T

Hyungi Ahn b09687d41d feat(scripts): Phase 1D Round 2 — controlled backfill stratification

기존 phase1d_pilot.py (단순 ai_domain × file_size 3-bucket) 를 plan
~/.claude/plans/stratified-mingling-otter.md 의 4축 + sample_source 분리
+ forced_include 로 augment.

Round 1 (ai_domain × file_size 3-bucket) 의 한계:
  pending PDFs 의 자연 분포만 반영 → 알려진 약점 (필기/스캔/한중일
  mixed OCR) 이 sample 에 안 들어옴. 1C 시각 확인에서 doc 4809
  (Note_240805_용접교육 필기) 가 실제로 그 패턴을 보였는데, 자연
  selection 에 맡기면 다음 라운드도 같은 case 가 빠질 위험.

Round 2 디자인:
  - 4 축 stratification: doc_type × file_size_band × text_density_band
    × handwritten_hint
  - sample_source ∈ {existing_success(5), controlled_backfill(25)}
  - forced_include doc 4809 — known bad anchor. 다음 튜닝/대안 도입 후
    같은 문서 재변환 결과와 1:1 비교 가능.
  - text_density = LENGTH(extracted_text) / (file_size / 1024) chars/KB
    가장 깨끗한 단일 proxy. 0.17(필기 4809) ↔ 94(born-digital 3759)
    양 끝 검증.
  - script_mix proxy: Hangul/CJK/Hiragana/Katakana/Latin Unicode block
    ratio → korean_dominant / mixed_korean_cjk / mixed_korean_latin /
    cjk_dominant / latin_dominant / unknown.
  - page_count_estimate: existing_success 는 md_extraction_quality.
    metrics.source_page_count 사용. controlled_backfill 은 NULL
    (marker 가 PyMuPDF 로 어차피 다시 읽음).
  - 시드 SAMPLE_SEED=20260502 고정, 재현성 보장.

Sample 분포 (실측 2026-05-02):
  bucket_label: born_digital=12, mixed=5, existing_calibration=4,
                handwritten=3, scan_likely=3, large=2, existing_anchor=1
  doc_type: Academic_Paper=7, study_note=6, Standard=5, Note=4,
            Reference=3, Manual=3, Drawing=1, Report=1
  file_size_band: M=14, S=12, L=4
  text_density_band: born-digital=15, scan-likely=9, mixed=6
  handwritten_hint: lo=26, hi=4 (모집단 1.1% 대비 13배 over-sample)
  forced anchor doc 4809 = density 0.17 (사용자 시각 확인의 그 문서)

새 subcommand:
  eval_template — pilot_1d_eval.csv 스켈레톤 (rubric 5축 1~5 +
  overall_pass + notes). 사용자가 MarkdownDoc + PDF 토글 비교하며
  점수 채움.

기존 cmd_enqueue (snapshot/backup/dedup) + cmd_report (quality 메트릭)
는 유지.

산출물:
  scripts/phase1d_pilot.py — 4축 + sample_source + forced_include +
    eval_template subcommand. CSV+JSON dual output.
  evals/markdown/README.md — rubric + decision matrix + workflow guide.
  evals/markdown/pilot_1d_sample.csv — 30 rows × 15 cols (시드 결과,
    재현성 보존).
  evals/markdown/pilot_1d_eval.csv — 빈 스켈레톤 (사용자 평가 후 채움).

실행 경계:
  Step 1~3 (selection / template / dry-run) = 본 PR 으로 완료.
  Step 4 (--yes enqueue, 실제 30건 markdown 큐 인입) = 사용자 timing
  승인 + 야간 단발 sweep 윈도우 (23:00~03:00 KST) 안에서 별도 실행.
  marker-service BATCH_SIZE=1, 30건 평균 5분/건 ≈ 2.5h.

Verify:
  GPU 서버 fastapi 컨테이너에서 select 실행 → 30건 sample CSV 생성됨.
  eval_template subcommand 동작 확인. enqueue dry-run 으로 30 doc_ids
  + snapshot 출력 후 사용자 취소 분기 확인.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-02 16:15:09 +09:00

app

feat(study): Phase 4-A explanation_md 길이 cap + prompt 강화

2026-05-02 08:33:18 +09:00

docs

feat(dashboard): §4 — 카테고리/제안/queue lag 카드 + docs/categories.md

2026-04-24 07:09:37 +09:00

evals

feat(scripts): Phase 1D Round 2 — controlled backfill stratification

2026-05-02 16:15:09 +09:00

frontend

feat(frontend): Phase 1C — markdown viewer 완성 (PDF 통합 + status badge + image placeholder)

2026-05-02 15:38:45 +09:00

gpu-server

infra: migrate application from Mac mini to GPU server

2026-04-03 07:47:09 +09:00

import_reports

fix(study): Phase 1 migrations 222-225 → 226-229 — markdown canonical layer 222 충돌 회피

2026-05-01 09:32:16 +09:00

migrations

ops(guardrails): activate migration 142 ask_events.source NOT NULL

2026-05-02 16:12:38 +09:00

reports

docs(search): Phase 2 최종 측정 보고서 (phase2_final.md + csv A/B)

2026-04-08 15:52:21 +09:00

scripts

feat(scripts): Phase 1D Round 2 — controlled backfill stratification

2026-05-02 16:15:09 +09:00

services

fix(canonical): marker engine_version via importlib.metadata

2026-05-01 00:19:46 +00:00

tests

fix(tests): explanation cap test setup — 한글 chunk 길이 부족 보정

2026-05-02 08:35:34 +09:00

.gitignore

ops(repo): results/ artifacts/ gitignore (eval calibration outputs)

2026-04-17 08:11:06 +09:00

Caddyfile

fix(ui): document-caddy trusted_proxies 설정 (mixed-content 해소)

2026-04-24 07:29:45 +09:00

CLAUDE.md

fix(search): soft_filter boost 약화 (domain 0.01, doctype 제거)

2026-04-08 15:40:04 +09:00

config.yaml

ops(infra): STT Mac mini 이전 + classifier 섹션 복원 (gemma4:e4b)

2026-04-24 10:08:00 +09:00

credentials.env.example

feat(verifier): Phase 3.5 B2 — numeric_conflict promote (env flag) + Tier 4

2026-04-17 08:11:06 +09:00

docker-compose.yml

feat(canonical): Phase 1B marker-service + marker_worker for PDF→markdown (222)

2026-05-01 00:06:23 +00:00

domain_policy.yaml

feat(policy): domain_policy.yaml v1 (safety_health + news)

2026-04-24 09:34:48 +09:00

README.md

infra: migrate application from Mac mini to GPU server

2026-04-03 07:47:09 +09:00

THIRD_PARTY_LICENSES.md

feat(study): iPad 손글씨 학습 세션 frontend (Phase 1)

2026-04-27 08:30:28 +09:00

README.md

hyungi_Document_Server

Self-hosted 개인 지식관리(PKM) 웹 애플리케이션

기술 스택

백엔드: FastAPI + SQLAlchemy (async)
데이터베이스: PostgreSQL 16 + pgvector + pg_trgm
프론트엔드: SvelteKit
문서 파싱: kordoc (HWP/HWPX/PDF → Markdown)
AI: Qwen3.5-35B-A3B (MLX), nomic-embed-text, Claude API (폴백)
인프라: Docker Compose, Caddy, Synology NAS

주요 기능

문서 자동 분류/태그/요약 (AI 기반)
전문검색 + 벡터 유사도 검색
HWP/PDF/Markdown 문서 뷰어
법령 변경 모니터링 (산업안전보건법 등)
이메일 자동 수집 (MailPlus IMAP)
일일 다이제스트
CalDAV 태스크 연동 (Synology Calendar)

Quick Start

git clone https://git.hyungi.net/hyungi/hyungi_document_server.git hyungi_Document_Server
cd hyungi_Document_Server

# 인증 정보 설정
cp credentials.env.example credentials.env
nano credentials.env  # 실제 값 입력

# 실행
docker compose up -d

http://localhost:8000/docs 에서 API 문서 확인

디렉토리 구조

├── app/              FastAPI 백엔드 (API, 워커, AI 클라이언트)
├── frontend/         SvelteKit 프론트엔드
├── services/kordoc/  문서 파싱 마이크로서비스 (Node.js)
├── gpu-server/       GPU 서버 배포 (AI Gateway)
├── migrations/       PostgreSQL 스키마
├── docs/             설계 문서, 배포 가이드
└── tests/            테스트 코드

인프라 구성

서버	역할
Mac mini M4 Pro	Docker Compose (FastAPI, PostgreSQL, kordoc, Caddy) + MLX AI
Synology NAS	파일 원본 저장, Synology Office/Drive/Calendar/MailPlus
GPU 서버	AI Gateway, 벡터 임베딩, OCR, 리랭킹

문서

아키텍처 — 전체 시스템 설계
배포 가이드 — Docker Compose 배포 방법
개발 단계 — Phase 0~5 개발 계획

Languages

Python 67%

Svelte 23.1%

Swift 5.3%

TypeScript 3.2%

Shell 0.5%

Other 0.9%