hyungi_document_server

Author	SHA1	Message	Date
Hyungi Ahn	b09687d41d	feat(scripts): Phase 1D Round 2 — controlled backfill stratification 기존 phase1d_pilot.py (단순 ai_domain × file_size 3-bucket) 를 plan ~/.claude/plans/stratified-mingling-otter.md 의 4축 + sample_source 분리 + forced_include 로 augment. Round 1 (ai_domain × file_size 3-bucket) 의 한계: pending PDFs 의 자연 분포만 반영 → 알려진 약점 (필기/스캔/한중일 mixed OCR) 이 sample 에 안 들어옴. 1C 시각 확인에서 doc 4809 (Note_240805_용접교육 필기) 가 실제로 그 패턴을 보였는데, 자연 selection 에 맡기면 다음 라운드도 같은 case 가 빠질 위험. Round 2 디자인: - 4 축 stratification: doc_type × file_size_band × text_density_band × handwritten_hint - sample_source ∈ {existing_success(5), controlled_backfill(25)} - forced_include doc 4809 — known bad anchor. 다음 튜닝/대안 도입 후 같은 문서 재변환 결과와 1:1 비교 가능. - text_density = LENGTH(extracted_text) / (file_size / 1024) chars/KB 가장 깨끗한 단일 proxy. 0.17(필기 4809) ↔ 94(born-digital 3759) 양 끝 검증. - script_mix proxy: Hangul/CJK/Hiragana/Katakana/Latin Unicode block ratio → korean_dominant / mixed_korean_cjk / mixed_korean_latin / cjk_dominant / latin_dominant / unknown. - page_count_estimate: existing_success 는 md_extraction_quality. metrics.source_page_count 사용. controlled_backfill 은 NULL (marker 가 PyMuPDF 로 어차피 다시 읽음). - 시드 SAMPLE_SEED=20260502 고정, 재현성 보장. Sample 분포 (실측 2026-05-02): bucket_label: born_digital=12, mixed=5, existing_calibration=4, handwritten=3, scan_likely=3, large=2, existing_anchor=1 doc_type: Academic_Paper=7, study_note=6, Standard=5, Note=4, Reference=3, Manual=3, Drawing=1, Report=1 file_size_band: M=14, S=12, L=4 text_density_band: born-digital=15, scan-likely=9, mixed=6 handwritten_hint: lo=26, hi=4 (모집단 1.1% 대비 13배 over-sample) forced anchor doc 4809 = density 0.17 (사용자 시각 확인의 그 문서) 새 subcommand: eval_template — pilot_1d_eval.csv 스켈레톤 (rubric 5축 1~5 + overall_pass + notes). 사용자가 MarkdownDoc + PDF 토글 비교하며 점수 채움. 기존 cmd_enqueue (snapshot/backup/dedup) + cmd_report (quality 메트릭) 는 유지. 산출물: scripts/phase1d_pilot.py — 4축 + sample_source + forced_include + eval_template subcommand. CSV+JSON dual output. evals/markdown/README.md — rubric + decision matrix + workflow guide. evals/markdown/pilot_1d_sample.csv — 30 rows × 15 cols (시드 결과, 재현성 보존). evals/markdown/pilot_1d_eval.csv — 빈 스켈레톤 (사용자 평가 후 채움). 실행 경계: Step 1~3 (selection / template / dry-run) = 본 PR 으로 완료. Step 4 (--yes enqueue, 실제 30건 markdown 큐 인입) = 사용자 timing 승인 + 야간 단발 sweep 윈도우 (23:00~03:00 KST) 안에서 별도 실행. marker-service BATCH_SIZE=1, 30건 평균 5분/건 ≈ 2.5h. Verify: GPU 서버 fastapi 컨테이너에서 select 실행 → 30건 sample CSV 생성됨. eval_template subcommand 동작 확인. enqueue dry-run 으로 30 doc_ids + snapshot 출력 후 사용자 취소 분기 확인. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 16:15:09 +09:00
Hyungi Ahn	7cab78e490	ops(canonical): Phase 1D enqueue 전 backup + targets + md_status 스냅샷 enqueue 시작 직전 3가지 흔적 남김: (1) /tmp/phase1d_pilot.json 의 timestamped 사본 (재실행 대비) (2) 대상 30건 document_id 한 줄 출력 (3) documents.md_status 분포 스냅샷 JSON 저장 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:00:33 +09:00
Hyungi Ahn	3e831a2dc7	fix(canonical): Phase 1D script sys.path — /app/scripts/.. 가 PYTHONPATH 루트 fastapi 컨테이너는 WORKDIR=/app, 코드가 직접 풀려있고 app/ 디렉토리 없음. backfill_category.py 의 ../app 패턴은 컨테이너 안에서 /app/app (없음) 가 되어 ModuleNotFoundError. 스크립트 자기 디렉토리의 .. 를 sys.path 에 넣어 /app 루트 노출. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 09:50:23 +09:00
Hyungi Ahn	f98cf2e505	ops(canonical): Phase 1D marker pilot one-shot script (select/enqueue/report) 30건 한정 stratified pilot. baseline markdown 품질 측정 후 Phase 2 전체 백필 결정. 영구 worker 경로 아님. 대상 WHERE: deleted_at IS NULL AND file_format='pdf' AND md_status='pending' AND category='document' AND document_type NOT IN SKIP_DOC_TYPES (marker_worker 와 일관) Stratification: ai_domain × file_size_bucket (small<500KB / medium<5MB / large) documents 에 page_count 컬럼 부재 (marker_worker 가 PyMuPDF 로 동적 측정) → file_size 를 길이 proxy 로 사용. cell 안에서 file_size 작은/큰 mix 로 짧은/긴 문서 차이 관찰. Subcommands: select — 30건 dry-run + JSON 저장 (/tmp/phase1d_pilot.json) enqueue — markdown 큐 enqueue (uq_queue_active 충돌 시 skip) report — md_status / 평균 elapsed / 실패 top5 / heading anchor 후보 / KaTeX 후보 / file_size bucket 별 success 비율 / UI 검수 URL 리포트 메모: markdown_image_count 는 현재 server.py 가 _images 버림 → 0 정상. Phase 1B.5 에서 _images 출력 시 자동 활성. 실행: docker compose exec fastapi python /app/scripts/phase1d_pilot.py select docker compose exec fastapi python /app/scripts/phase1d_pilot.py enqueue --yes docker compose exec fastapi python /app/scripts/phase1d_pilot.py report Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 09:49:17 +09:00

4 Commits