feat(scripts): Phase 1D Round 2 — controlled backfill stratification

기존 phase1d_pilot.py (단순 ai_domain × file_size 3-bucket) 를 plan ~/.claude/plans/stratified-mingling-otter.md 의 4축 + sample_source 분리 + forced_include 로 augment. Round 1 (ai_domain × file_size 3-bucket) 의 한계: pending PDFs 의 자연 분포만 반영 → 알려진 약점 (필기/스캔/한중일 mixed OCR) 이 sample 에 안 들어옴. 1C 시각 확인에서 doc 4809 (Note_240805_용접교육 필기) 가 실제로 그 패턴을 보였는데, 자연 selection 에 맡기면 다음 라운드도 같은 case 가 빠질 위험. Round 2 디자인: - 4 축 stratification: doc_type × file_size_band × text_density_band × handwritten_hint - sample_source ∈ {existing_success(5), controlled_backfill(25)} - forced_include doc 4809 — known bad anchor. 다음 튜닝/대안 도입 후 같은 문서 재변환 결과와 1:1 비교 가능. - text_density = LENGTH(extracted_text) / (file_size / 1024) chars/KB 가장 깨끗한 단일 proxy. 0.17(필기 4809) ↔ 94(born-digital 3759) 양 끝 검증. - script_mix proxy: Hangul/CJK/Hiragana/Katakana/Latin Unicode block ratio → korean_dominant / mixed_korean_cjk / mixed_korean_latin / cjk_dominant / latin_dominant / unknown. - page_count_estimate: existing_success 는 md_extraction_quality. metrics.source_page_count 사용. controlled_backfill 은 NULL (marker 가 PyMuPDF 로 어차피 다시 읽음). - 시드 SAMPLE_SEED=20260502 고정, 재현성 보장. Sample 분포 (실측 2026-05-02): bucket_label: born_digital=12, mixed=5, existing_calibration=4, handwritten=3, scan_likely=3, large=2, existing_anchor=1 doc_type: Academic_Paper=7, study_note=6, Standard=5, Note=4, Reference=3, Manual=3, Drawing=1, Report=1 file_size_band: M=14, S=12, L=4 text_density_band: born-digital=15, scan-likely=9, mixed=6 handwritten_hint: lo=26, hi=4 (모집단 1.1% 대비 13배 over-sample) forced anchor doc 4809 = density 0.17 (사용자 시각 확인의 그 문서) 새 subcommand: eval_template — pilot_1d_eval.csv 스켈레톤 (rubric 5축 1~5 + overall_pass + notes). 사용자가 MarkdownDoc + PDF 토글 비교하며 점수 채움. 기존 cmd_enqueue (snapshot/backup/dedup) + cmd_report (quality 메트릭) 는 유지. 산출물: scripts/phase1d_pilot.py — 4축 + sample_source + forced_include + eval_template subcommand. CSV+JSON dual output. evals/markdown/README.md — rubric + decision matrix + workflow guide. evals/markdown/pilot_1d_sample.csv — 30 rows × 15 cols (시드 결과, 재현성 보존). evals/markdown/pilot_1d_eval.csv — 빈 스켈레톤 (사용자 평가 후 채움). 실행 경계: Step 1~3 (selection / template / dry-run) = 본 PR 으로 완료. Step 4 (--yes enqueue, 실제 30건 markdown 큐 인입) = 사용자 timing 승인 + 야간 단발 sweep 윈도우 (23:00~03:00 KST) 안에서 별도 실행. marker-service BATCH_SIZE=1, 30건 평균 5분/건 ≈ 2.5h. Verify: GPU 서버 fastapi 컨테이너에서 select 실행 → 30건 sample CSV 생성됨. eval_template subcommand 동작 확인. enqueue dry-run 으로 30 doc_ids + snapshot 출력 후 사용자 취소 분기 확인. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:15:09 +09:00
parent 91e7a64713
commit b09687d41d
4 changed files with 616 additions and 79 deletions
@@ -0,0 +1,119 @@
+# Phase 1D — Markdown Conversion Pilot 평가
+
+> Plan: `~/.claude/plans/stratified-mingling-otter.md`
+> Script: `scripts/phase1d_pilot.py` (subcommands: select / enqueue / report / eval_template)
+
+## 목적
+
+30건 stratified sample 로 marker-pdf 의 **failure mode 종류** 발견. 통계적 대표성이 아니라 **진단 도구**. 결과로 다음 분기점 판정:
+- Phase 2 풀 backfill 진입 가능?
+- SKIP rule 확장 필요?
+- Marker 튜닝 / 대안 (kordoc / OCR 전처리 hybrid) 우선?
+
+## Sample 구성
+
+`pilot_1d_sample.csv` — 30 rows × 15 columns. 시드 `20260502` 고정.
+
+### sample_source 분리
+
+| sample_source | n | 의미 |
+|---|---|---|
+| `existing_success` | 5 | 기존에 변환 성공한 PDFs. forced anchor (doc 4809 `Note_240805_용접교육 필기`) + calibration 4. **pilot 후 같은 문서 재변환 결과와 비교해 개선 여부 판정 anchor**. |
+| `controlled_backfill` | 25 | pending 262건 中 4축 stratified 로 신규 변환. 분포: handwritten 3 / scan_likely 2~3 / mixed 5 / born_digital 12 / large 2 |
+
+### 4 축 stratification
+
+| Axis | Buckets |
+|---|---|
+| `doc_type` | study_note / Academic_Paper / Reference / Note / Manual / Standard / Specification / NULL |
+| `file_size_band` | S (<1MB) / M (1~10MB) / L (>10MB) |
+| `text_density_band` | scan-likely (<5 chars/KB) / mixed (5~50) / born-digital (>50) |
+| `handwritten_hint` | hi (title/path 매칭: 필기/노트/handwritten/scan/스캔) / lo |
+
+보조 컬럼: `script_mix` (Hangul/CJK/Latin 비율 라벨), `page_count_estimate` (existing_success 만 채워짐), `forced_include_reason`.
+
+## Rubric (사용자 평가, 1~5 점)
+
+각 sample 1건 당 **MarkdownDoc viewer + PDF 원본 토글** 비교하면서 5축 점수 + boolean + notes:
+
+| 축 | 정의 | 1점 | 5점 |
+|---|---|---|---|
+| **text_accuracy** | OCR/추출 정확도 | 알아보기 어려움, ghost text 다발 | 원본과 거의 동일, OCR 오타 1~2건 |
+| **structure** | heading/list/table 구조 보존 | 구조 완전 유실, 한 덩어리 텍스트 | 원본의 heading 계층 + table row 그대로 |
+| **noise_rate** | 의미 없는 반복/garbage 토큰 | 본문 30%+ 가 noise | noise 거의 없음 |
+| **multi_script** | 한중일/특수문자 혼합 정확도 | 잘못된 스크립트로 mojibake | 원본 스크립트 그대로 보존 |
+| **completeness** | 본문 누락 | 페이지 절반 이상 빠짐 | 누락 없음 |
+
+`overall_pass` (true/false) — "이 markdown 으로 검색/참고에 쓸 만한가" 직관 판단. rubric 점수 합계와 별도로 보존.
+
+`notes` — 자유서술. 특히 알려진 failure pattern (예: `TO STAND 12/4` 반복, 한중일 mojibake) 재현 시 명시.
+
+## 평가 워크플로우
+
+### 0. Pre-eval
+
+`evals/markdown/pilot_1d_eval.csv` 가 비어 있다면 (또는 새 라운드면) 스켈레톤 생성:
+
+```bash
+ssh hyungi@100.111.160.84 \
+  "docker compose -f ~/Documents/code/hyungi_Document_Server/docker-compose.yml \
+   exec fastapi python /app/scripts/phase1d_pilot.py eval_template \
+   --in /tmp/phase1d_pilot.json \
+   --csv /app/evals/markdown/pilot_1d_eval.csv"
+```
+
+### 1. 한 건씩 평가
+
+브라우저에서 `https://document.hyungi.net/documents/<doc_id>` 열기:
+1. 기본 표시 (Markdown 또는 PDF iframe — `canShowMarkdown` 따라) 확인
+2. PDF 원본 토글 클릭해서 PDF 와 비교
+3. 5축 점수 매기기 (1~5)
+4. `overall_pass` true/false 결정
+5. notes 에 발견된 failure pattern 기록 (있으면)
+6. 결과를 `evals/markdown/pilot_1d_eval.csv` 에 입력
+
+10건씩 3 세션 분할 권장 (총 ~2.5h 사람 시간).
+
+### 2. 의사결정 매트릭스
+
+평가 끝난 30건의 분포로:
+
+| 결과 패턴 | 다음 액션 |
+|---|---|
+| overall_pass ≥ 25/30 (83%+) 전 영역 | Phase 2 풀 backfill 본 plan 작성. SKIP rule 확장 불필요. |
+| overall_pass 20~24 + 특정 영역 (예: 필기) 만 fail | SKIP_DOC_TYPES / source_kind heuristic 으로 약점 영역 제외 → 나머지 풀 backfill |
+| overall_pass < 20 또는 systemic 결함 (multi_script 전반 fail 등) | Marker 설정 튜닝 또는 대안 (kordoc vs marker 비교, OCR 전처리 추가) — Phase 1B 재설계 |
+| backfill 자체 실패율 > 10% (failed/timeout) | marker-service 안정화 우선. 1D 평가 보류. |
+
+### 3. anchor 비교
+
+`existing_anchor` (doc 4809) 의 평가 결과는 다음 라운드 (Marker 튜닝 또는 대안 도입 후) 같은 문서 재변환 결과와 1:1 비교. 점수 개선 여부가 튜닝 효과의 가장 깨끗한 신호.
+
+### 4. Marker 자가 metrics 와 cross-check
+
+`md_extraction_quality.metrics` (markdown_heading_count / markdown_table_row_count / text_length_ratio 등) 는 Marker 자가 진단. 사람 평가와 비교:
+- Marker 가 "tables=237" 인데 사람 평가 structure=1 → 자가 진단 false positive
+- text_length_ratio < 1 인데 사람 평가 completeness=5 → ratio 가 좋은 proxy 아닐 수 있음
+
+이런 mismatch 가 `md_extraction_quality.score` 정의의 출발점 (현재 score 항상 null).
+
+## 파일
+
+| 파일 | 역할 | 갱신 시점 |
+|---|---|---|
+| `pilot_1d_sample.csv` | 30건 sample 정의 (선정 결과). 시드 `20260502` 재현 가능. | select 결과 commit (1회) |
+| `pilot_1d_eval.csv` | 사용자 평가 결과 (rubric 점수 + overall_pass + notes) | 사용자 평가 종료 시 commit |
+| `README.md` | 본 가이드 | 초기 commit |
+
+## 실행 환경
+
+GPU 서버 fastapi 컨테이너 안에서 실행 — DB / NAS NFS / md_extraction_quality JSONB 접근 필요:
+
+```bash
+ssh hyungi@100.111.160.84
+cd ~/Documents/code/hyungi_Document_Server
+docker compose exec fastapi python /app/scripts/phase1d_pilot.py select \
+  --csv /app/evals/markdown/pilot_1d_sample.csv
+```
+
+**enqueue 의 `--yes` 또는 `--no-dry-run` 류 실행은 별도 사용자 승인 + 야간 단발 sweep 윈도우 (23:00~03:00 KST) 안에서만**. 30건 backfill = marker-service BATCH_SIZE=1 × 평균 5분/건 ≈ 2.5h.
@@ -0,0 +1,31 @@
+doc_id,title,sample_source,bucket_label,text_accuracy,structure,noise_rate,multi_script,completeness,overall_pass,notes
+4809,Note_240805_용접교육 필기,existing_success,existing_anchor,,,,,,,
+5248,작업자 재난안전사고 예방을 위한 위험성평가 기법 연구,existing_success,existing_calibration,,,,,,,
+4068,공업역학 동역학(제13판)_Chapter 21 3차원 강체 운동역학,existing_success,existing_calibration,,,,,,,
+5189,VIII-1_08-UB,existing_success,existing_calibration,,,,,,,
+5141,Structural Analysiss and Design of Process Equipment_00_Contents,existing_success,existing_calibration,,,,,,,
+4815,Note_240830_소음진동교육 필기,controlled_backfill,handwritten,,,,,,,
+4798,Note_240528_다이아프람워크숍,controlled_backfill,handwritten,,,,,,,
+4813,Note_240827_필기,controlled_backfill,handwritten,,,,,,,
+5151,THE PIPE FABRICATORS BLUE BOOK,controlled_backfill,scan_likely,,,,,,,
+5268,황현필의 진보를 위한 역사_6장 제주4-3사건의 왜국을 멈추라,controlled_backfill,scan_likely,,,,,,,
+5127,표준기계설계(KS)_08_핀,controlled_backfill,scan_likely,,,,,,,
+8855,2월 26일,controlled_backfill,mixed,,,,,,,
+4061,공업역학 동역학(제13판)_Chapter 14 질점의 운동역학_일과 에너지,controlled_backfill,mixed,,,,,,,
+3782,"Safety and Health for Engineers_02_5 Local, International, and Voluntary Laws, Regulations, and Standards",controlled_backfill,mixed,,,,,,,
+5179,Hydrogen-Embrittlement,controlled_backfill,mixed,,,,,,,
+5133,압력용기 핸드북_기타,controlled_backfill,mixed,,,,,,,
+3757,Industrial Safety and Health Management(7-ED)_2 Development of the safety and Health Function,controlled_backfill,born_digital,,,,,,,
+3758,Industrial Safety and Health Management(7-ED)_3 Concepts of Hazard Avoidance,controlled_backfill,born_digital,,,,,,,
+5163,국내 지속가능경영보고서의 노동인권 분야에 대한 실태 분석,controlled_backfill,born_digital,,,,,,,
+5167,우리나라 기업의 환경정보 공시 현황과 제도적 개선방안,controlled_backfill,born_digital,,,,,,,
+5154,국내 금속가공 중소기업의 스마트팩토리 활용 정도에 대한 실증적 연구,controlled_backfill,born_digital,,,,,,,
+5155,스마트 팩토리의 전략적 활용 연구,controlled_backfill,born_digital,,,,,,,
+5137,Pressure Vessel Design Manual_01 General Topics,controlled_backfill,born_digital,,,,,,,
+5211,PTB-4-2013_00_Foreword,controlled_backfill,born_digital,,,,,,,
+5178,Hydrogen_Piping_and_Pipelines_ASME_Code,controlled_backfill,born_digital,,,,,,,
+5168,TCoYourPaperlessOffice-4.0,controlled_backfill,born_digital,,,,,,,
+3765,Industrial Safety and Health Management(7-ED)_10 Environmental Control and Noise,controlled_backfill,born_digital,,,,,,,
+3769,Industrial Safety and Health Management(7-ED)_14 Materials Handling and Storage,controlled_backfill,born_digital,,,,,,,
+5274,황현필의 진보를 위한 역사_12장 대한민국의 정신을 훼손하지 말라,controlled_backfill,large,,,,,,,
+5180,ASME Sec I 2025,controlled_backfill,large,,,,,,,
@@ -0,0 +1,31 @@
+doc_id,title,sample_source,forced_include_reason,bucket_label,doc_type,file_size,file_size_band,text_len,text_density,text_density_band,handwritten_hint,scan_likely,script_mix,page_count_estimate
+4809,Note_240805_용접교육 필기,existing_success,known_bad_handwritten_anchor,existing_anchor,Note,1089182,M,177,0.166,scan-likely,hi,true,latin_dominant,
+5248,작업자 재난안전사고 예방을 위한 위험성평가 기법 연구,existing_success,,existing_calibration,Academic_Paper,942262,S,3191,3.468,scan-likely,lo,true,unknown,
+4068,공업역학 동역학(제13판)_Chapter 21 3차원 강체 운동역학,existing_success,,existing_calibration,Academic_Paper,5429661,M,45253,8.534,mixed,lo,false,mixed_hangul_latin,
+5189,VIII-1_08-UB,existing_success,,existing_calibration,Standard,140322,S,40457,295.235,born-digital,lo,false,unknown,
+5141,Structural Analysiss and Design of Process Equipment_00_Contents,existing_success,,existing_calibration,Reference,520220,S,31883,62.758,born-digital,lo,false,latin_dominant,
+4815,Note_240830_소음진동교육 필기,controlled_backfill,,handwritten,Drawing,12659094,L,3524,0.285,scan-likely,hi,true,unknown,
+4798,Note_240528_다이아프람워크숍,controlled_backfill,,handwritten,Note,236840,S,1030,4.453,scan-likely,hi,true,hangul_dominant,
+4813,Note_240827_필기,controlled_backfill,,handwritten,Note,710770,S,43,0.062,scan-likely,hi,true,unknown,
+5151,THE PIPE FABRICATORS BLUE BOOK,controlled_backfill,,scan_likely,Manual,40063084,L,136448,3.488,scan-likely,lo,true,unknown,
+5268,황현필의 진보를 위한 역사_6장 제주4-3사건의 왜국을 멈추라,controlled_backfill,,scan_likely,Note,6188759,M,8746,1.447,scan-likely,lo,true,hangul_dominant,
+5127,표준기계설계(KS)_08_핀,controlled_backfill,,scan_likely,Standard,6703655,M,10423,1.592,scan-likely,lo,true,unknown,
+8855,2월 26일,controlled_backfill,,mixed,Report,121611,S,2048,17.245,mixed,lo,false,unknown,
+4061,공업역학 동역학(제13판)_Chapter 14 질점의 운동역학_일과 에너지,controlled_backfill,,mixed,Academic_Paper,5850755,M,44811,7.843,mixed,lo,false,mixed_hangul_latin,
+3782,"Safety and Health for Engineers_02_5 Local, International, and Voluntary Laws, Regulations, and Standards",controlled_backfill,,mixed,study_note,4822580,M,46808,9.939,mixed,lo,false,latin_dominant,
+5179,Hydrogen-Embrittlement,controlled_backfill,,mixed,Reference,430502,S,9400,22.359,mixed,lo,false,mixed_hangul_latin,
+5133,압력용기 핸드북_기타,controlled_backfill,,mixed,Reference,1813221,M,51754,29.228,mixed,lo,false,mixed_hangul_latin,
+3757,Industrial Safety and Health Management(7-ED)_2 Development of the safety and Health Function,controlled_backfill,,born_digital,study_note,2849250,M,139372,50.089,born-digital,lo,false,latin_dominant,
+3758,Industrial Safety and Health Management(7-ED)_3 Concepts of Hazard Avoidance,controlled_backfill,,born_digital,study_note,1506926,M,106008,72.036,born-digital,lo,false,latin_dominant,
+5163,국내 지속가능경영보고서의 노동인권 분야에 대한 실태 분석,controlled_backfill,,born_digital,study_note,640161,S,54423,87.055,born-digital,lo,false,unknown,
+5167,우리나라 기업의 환경정보 공시 현황과 제도적 개선방안,controlled_backfill,,born_digital,Academic_Paper,718354,S,44395,63.284,born-digital,lo,false,hangul_dominant,
+5154,국내 금속가공 중소기업의 스마트팩토리 활용 정도에 대한 실증적 연구,controlled_backfill,,born_digital,Academic_Paper,257730,S,23896,94.942,born-digital,lo,false,mixed_hangul_latin,
+5155,스마트 팩토리의 전략적 활용 연구,controlled_backfill,,born_digital,Academic_Paper,693276,S,75466,111.467,born-digital,lo,false,mixed_hangul_latin,
+5137,Pressure Vessel Design Manual_01 General Topics,controlled_backfill,,born_digital,Manual,2421078,M,123902,52.405,born-digital,lo,false,latin_dominant,
+5211,PTB-4-2013_00_Foreword,controlled_backfill,,born_digital,Standard,416742,S,29162,71.656,born-digital,lo,false,unknown,
+5178,Hydrogen_Piping_and_Pipelines_ASME_Code,controlled_backfill,,born_digital,Standard,3131091,M,1162861,380.305,born-digital,lo,false,unknown,
+5168,TCoYourPaperlessOffice-4.0,controlled_backfill,,born_digital,Manual,1526520,M,253397,169.98,born-digital,lo,false,unknown,
+3765,Industrial Safety and Health Management(7-ED)_10 Environmental Control and Noise,controlled_backfill,,born_digital,study_note,1408586,M,80003,58.16,born-digital,lo,false,latin_dominant,
+3769,Industrial Safety and Health Management(7-ED)_14 Materials Handling and Storage,controlled_backfill,,born_digital,study_note,1785536,M,91902,52.706,born-digital,lo,false,latin_dominant,
+5274,황현필의 진보를 위한 역사_12장 대한민국의 정신을 훼손하지 말라,controlled_backfill,,large,Academic_Paper,14996152,L,35398,2.417,scan-likely,lo,true,hangul_dominant,
+5180,ASME Sec I 2025,controlled_backfill,,large,Standard,14890413,L,1603702,110.285,born-digital,lo,false,unknown,
@@ -4,26 +4,35 @@
 * Phase 2 전체 백필 결정은 1D 결과 보고 후행.
 * 1B.5 (이미지 추출 / _meta 보존) 는 별도 PR — 본 스크립트 영역 아님.

-Stratification:
-  ai_domain × file_size_bucket  (page_count 는 documents 컬럼 없음 → file_size proxy)
-  보조: 각 cell 안에서 file_size 작은/큰 mix.
-  document_type ∈ SKIP_DOC_TYPES 제외 (marker_worker 의 SKIP 룰과 동일).
+Stratification (Round 2 refined, plan: ~/.claude/plans/stratified-mingling-otter.md):
+  4 축: doc_type × file_size_band × text_density_band × handwritten_hint
+  + sample_source ∈ {existing_success, controlled_backfill}
+    - existing_success 5건 (anchor 1 + calibration 4)
+    - controlled_backfill 25건 (handwritten 3 / scan_likely 2~3 / mixed 5 / born_digital 12 / large 2)
+  + forced_include: doc 4809 (Note_240805_용접교육 필기) — known bad handwritten anchor.
+  document_type ∈ SKIP_DOC_TYPES 제외 (marker_worker 룰 미러).

 Subcommands:
-  select   stratified 30건 dry-run + JSON 저장
-  enqueue  select 결과를 markdown 큐에 enqueue (uq_queue_active 위반 회피)
-  report   md_status 분포·실패사유·quality 메트릭·UI 검수 URL 출력
+  select         stratified 30건 dry-run + CSV+JSON 저장
+  enqueue        select 결과를 markdown 큐에 enqueue (uq_queue_active 위반 회피)
+  report         md_status 분포·실패사유·quality 메트릭·UI 검수 URL 출력
+  eval_template  pilot_1d_eval.csv 스켈레톤 출력 (사용자가 rubric 5축 점수 채움)

 실행 (GPU 서버):
-  docker compose exec fastapi python /app/scripts/phase1d_pilot.py select
+  docker compose exec fastapi python /app/scripts/phase1d_pilot.py select \
+    --csv /app/evals/markdown/pilot_1d_sample.csv
  docker compose exec fastapi python /app/scripts/phase1d_pilot.py enqueue --yes
  docker compose exec fastapi python /app/scripts/phase1d_pilot.py report
+  docker compose exec fastapi python /app/scripts/phase1d_pilot.py eval_template \
+    --csv /app/evals/markdown/pilot_1d_eval.csv
 """

 import argparse
 import asyncio
+import csv
 import json
 import os
+import random
 import re
 import sys
 from collections import Counter, defaultdict
@@ -49,11 +58,43 @@ SIZE_BUCKETS = [
    ("large",  5 * 1024 * 1024, 10**12),              # > 5MB
 ]

+# 4축 stratification 의 file_size_band — Round 2 plan
+FILE_SIZE_BAND_THRESHOLDS = [
+    ("S", 0, 1 * 1024 * 1024),                        # < 1MB
+    ("M", 1 * 1024 * 1024, 10 * 1024 * 1024),         # 1~10MB
+    ("L", 10 * 1024 * 1024, 10**12),                  # > 10MB
+]
+
+# text_density (chars per KB of file) — born-digital vs scan 구분 단일 깨끗한 proxy.
+# 0.17 (필기 4809) ↔ 94 (born-digital 3759) 양 끝 검증됨.
+TEXT_DENSITY_BANDS = [
+    ("scan-likely",  0.0,  5.0),
+    ("mixed",        5.0,  50.0),
+    ("born-digital", 50.0, float("inf")),
+]
+
+HANDWRITTEN_HINT_REGEX = re.compile(r"필기|노트|handwritten|scan|스캔|note", re.IGNORECASE)
+
+# Forced include — 사용자 시각 확인에서 발견된 known bad anchor.
+# 1D 결과로 다음 라운드 튜닝 시 같은 문서를 재변환해 개선 여부 판정.
+FORCED_INCLUDES: dict[int, str] = {
+    4809: "known_bad_handwritten_anchor",
+}
+
+# 재현성 시드 — 한 번 만든 sample CSV 가 동일 결과 보장.
+SAMPLE_SEED = 20260502
+
 PILOT_TARGET = 30
+EXISTING_SUCCESS_TARGET = 5
+CONTROLLED_BACKFILL_TARGET = PILOT_TARGET - EXISTING_SUCCESS_TARGET  # 25
+
 DEFAULT_OUT = Path("/tmp/phase1d_pilot.json")
+DEFAULT_CSV = Path("/tmp/phase1d_pilot.csv")
+DEFAULT_EVAL_CSV = Path("/tmp/phase1d_eval.csv")


 def _bucket(file_size: int | None) -> str:
+    """legacy 3-bucket — cmd_report 의 file_size bucket 호환."""
    if file_size is None:
        return "unknown"
    for name, lo, hi in SIZE_BUCKETS:
@@ -62,6 +103,111 @@ def _bucket(file_size: int | None) -> str:
    return "outlier"


+def _file_size_band(file_size: int | None) -> str:
+    """Round 2 refined band: S / M / L."""
+    if file_size is None:
+        return "unknown"
+    for name, lo, hi in FILE_SIZE_BAND_THRESHOLDS:
+        if lo <= file_size < hi:
+            return name
+    return "L"
+
+
+def _text_density(text_len: int, file_size: int | None) -> float | None:
+    """chars per KB of file. file_size==0/None 이면 None."""
+    if not file_size or file_size <= 0:
+        return None
+    return text_len / (file_size / 1024.0)
+
+
+def _text_density_band(density: float | None) -> str:
+    if density is None:
+        return "unknown"
+    for name, lo, hi in TEXT_DENSITY_BANDS:
+        if lo <= density < hi:
+            return name
+    return "unknown"
+
+
+def _handwritten_hint(title: str | None, file_path: str | None) -> str:
+    """title 또는 file_path 에 필기/노트/handwritten/scan 매칭 → 'hi' / 'lo'."""
+    blob = " ".join(filter(None, [title or "", file_path or ""]))
+    return "hi" if HANDWRITTEN_HINT_REGEX.search(blob) else "lo"
+
+
+def _scan_likely(text_len: int, file_size: int | None, density: float | None) -> bool:
+    """text_density < 5 또는 extracted_text 부재 → 스캔 가능성 높음."""
+    if text_len == 0:
+        return True
+    if density is not None and density < 5.0:
+        return True
+    return False
+
+
+def _script_mix(extracted_text: str | None, sample_chars: int = 10000) -> str:
+    """첫 N자에서 Hangul/CJK/Hiragana/Katakana/Latin 비율로 라벨링.
+    한 script ≥ 0.7 → '<script>_dominant'
+    두 script 각 ≥ 0.1 → 'mixed_<a>_<b>'
+    그 외 → 'unknown'.
+    mojibake/OCR 노이즈가 심하면 비율이 이상하게 나오는데, 그것도 신호.
+    """
+    if not extracted_text:
+        return "unknown"
+    sample = extracted_text[:sample_chars]
+    counts = {"hangul": 0, "cjk": 0, "kana": 0, "latin": 0, "other": 0}
+    total = 0
+    for ch in sample:
+        cp = ord(ch)
+        if ch.isspace():
+            continue
+        total += 1
+        if 0xAC00 <= cp <= 0xD7A3 or 0x1100 <= cp <= 0x11FF or 0x3130 <= cp <= 0x318F:
+            counts["hangul"] += 1
+        elif 0x4E00 <= cp <= 0x9FFF or 0x3400 <= cp <= 0x4DBF:
+            counts["cjk"] += 1
+        elif 0x3040 <= cp <= 0x30FF:
+            counts["kana"] += 1
+        elif (0x0041 <= cp <= 0x005A) or (0x0061 <= cp <= 0x007A) or (0x00C0 <= cp <= 0x024F):
+            counts["latin"] += 1
+        else:
+            counts["other"] += 1
+    if total == 0:
+        return "unknown"
+    ratios = {k: v / total for k, v in counts.items() if k != "other"}
+    primary = sorted(ratios.items(), key=lambda x: -x[1])
+    if primary[0][1] >= 0.7:
+        return f"{primary[0][0]}_dominant"
+    significant = [name for name, r in primary if r >= 0.1]
+    if len(significant) >= 2:
+        return "mixed_" + "_".join(sorted(significant[:2]))
+    return "unknown"
+
+
+def _page_count_estimate(md_extraction_quality: dict | None) -> int | None:
+    """existing_success 의 marker quality.metrics.source_page_count 가 있으면 사용.
+    controlled_backfill 은 marker 가 변환 시 PyMuPDF 로 채울 예정 → NULL.
+    사후 평가 해석용 보조값."""
+    if not md_extraction_quality or not isinstance(md_extraction_quality, dict):
+        return None
+    metrics = md_extraction_quality.get("metrics")
+    if not isinstance(metrics, dict):
+        return None
+    pc = metrics.get("source_page_count")
+    if isinstance(pc, int):
+        return pc
+    return None
+
+
+# Sample CSV 컬럼 순서 — plan §"Sample CSV 컬럼" 과 동일.
+CSV_COLUMNS = [
+    "doc_id", "title", "sample_source", "forced_include_reason", "bucket_label",
+    "doc_type", "file_size", "file_size_band",
+    "text_len", "text_density", "text_density_band",
+    "handwritten_hint", "scan_likely", "script_mix",
+    "page_count_estimate",
+]
+
+
 def _build_engine() -> "AsyncEngine":
    db_url = os.environ["DATABASE_URL"]
    return create_async_engine(db_url, pool_pre_ping=True)
@@ -69,25 +215,200 @@ def _build_engine() -> "AsyncEngine":

 # ─── select ───

-async def cmd_select(out_path: Path) -> None:
+def _enrich_row(r) -> dict:
+    """Document row → sample dict (compute proxies)."""
+    text_len = len(r.extracted_text or "")
+    density = _text_density(text_len, r.file_size)
+    return {
+        "doc_id": r.id,
+        "title": r.title,
+        "doc_type": r.document_type,
+        "file_size": r.file_size or 0,
+        "file_size_band": _file_size_band(r.file_size),
+        "text_len": text_len,
+        "text_density": round(density, 3) if density is not None else None,
+        "text_density_band": _text_density_band(density),
+        "handwritten_hint": _handwritten_hint(r.title, r.file_path),
+        "scan_likely": _scan_likely(text_len, r.file_size, density),
+        "script_mix": _script_mix(r.extracted_text),
+        "page_count_estimate": _page_count_estimate(r.md_extraction_quality),
+        "_file_path": r.file_path,
+        "_ai_domain": r.ai_domain,
+    }
+
+
+def _allocate_controlled_backfill(candidates: list[dict], rng: random.Random) -> list[dict]:
+    """controlled_backfill 25건 — 4축 stratified + 의도적 over-sample.
+
+    plan §"Sample budget":
+      handwritten 3 / scan_likely 2~3 / mixed 5 / born_digital 12 / large 2
+    """
+    selected: list[dict] = []
+    used: set[int] = set()
+
+    def take(pool: list[dict], n: int, label: str) -> None:
+        avail = [c for c in pool if c["doc_id"] not in used]
+        rng.shuffle(avail)
+        for c in avail[:n]:
+            c["bucket_label"] = label
+            selected.append(c)
+            used.add(c["doc_id"])
+
+    # 1. handwritten_hint=hi (전 3건 채택, 모집단 1.1% → sample 10%)
+    take([c for c in candidates if c["handwritten_hint"] == "hi"], 3, "handwritten")
+
+    # 2. scan-likely (handwritten 과 dedupe 후 2~3건)
+    take([c for c in candidates if c["text_density_band"] == "scan-likely"], 3, "scan_likely")
+
+    # 3. mixed (5건)
+    take([c for c in candidates if c["text_density_band"] == "mixed"], 5, "mixed")
+
+    # 4. born-digital × doc_type 다양 (12건). doc_type 분포 가이드:
+    #    study_note 3 / Academic_Paper 3 / Reference 2 / Note 1 / (Manual+Standard+Specification) 2 / NULL 1
+    born_digital = [c for c in candidates if c["text_density_band"] == "born-digital" and c["doc_id"] not in used]
+    by_type: dict[str | None, list[dict]] = defaultdict(list)
+    for c in born_digital:
+        by_type[c["doc_type"]].append(c)
+    for pool in by_type.values():
+        rng.shuffle(pool)
+
+    target_quota = [
+        ("study_note", 3),
+        ("Academic_Paper", 3),
+        ("Reference", 2),
+        ("Note", 1),
+        ("Manual", 1),
+        ("Standard", 1),
+        (None, 1),  # NULL
+    ]
+    born_added = 0
+    for dt, n in target_quota:
+        for c in by_type.get(dt, [])[:n]:
+            if c["doc_id"] in used:
+                continue
+            c["bucket_label"] = "born_digital"
+            selected.append(c)
+            used.add(c["doc_id"])
+            born_added += 1
+            if born_added >= 12:
+                break
+        if born_added >= 12:
+            break
+    # 12 미달 시 남은 born-digital 로 채움 (doc_type 무관)
+    if born_added < 12:
+        leftover = [c for c in born_digital if c["doc_id"] not in used]
+        rng.shuffle(leftover)
+        for c in leftover[: 12 - born_added]:
+            c["bucket_label"] = "born_digital"
+            selected.append(c)
+            used.add(c["doc_id"])
+
+    # 5. file_size band=L (>10MB) — 위 4 bucket 안 든 것 보충 (목표 2건)
+    large_pool = [c for c in candidates if c["file_size_band"] == "L" and c["doc_id"] not in used]
+    take(large_pool, 2, "large")
+
+    # 30 - existing_success(5) = 25 가 목표. 부족하면 일반 pending 에서 보충.
+    if len(selected) < CONTROLLED_BACKFILL_TARGET:
+        leftover = [c for c in candidates if c["doc_id"] not in used]
+        rng.shuffle(leftover)
+        for c in leftover[: CONTROLLED_BACKFILL_TARGET - len(selected)]:
+            c["bucket_label"] = "filler"
+            selected.append(c)
+            used.add(c["doc_id"])
+    return selected[:CONTROLLED_BACKFILL_TARGET]
+
+
+def _allocate_existing_success(rows: list, rng: random.Random) -> list[dict]:
+    """existing_success 5건 = forced anchor 1 + calibration 4 (text_density 분포 균형)."""
+    enriched = [_enrich_row(r) for r in rows]
+    by_id = {c["doc_id"]: c for c in enriched}
+    selected: list[dict] = []
+    used: set[int] = set()
+
+    # forced_include
+    for fid, reason in FORCED_INCLUDES.items():
+        if fid in by_id:
+            c = by_id[fid]
+            c["bucket_label"] = "existing_anchor"
+            c["forced_include_reason"] = reason
+            selected.append(c)
+            used.add(fid)
+        else:
+            print(f"[warn] forced_include doc_id={fid} 가 existing_success 후보에 없음 — skip")
+
+    # calibration 4건 — text_density 분포 균형
+    remaining = [c for c in enriched if c["doc_id"] not in used]
+    quotas: list[tuple[str, int]] = [("scan-likely", 1), ("mixed", 1), ("born-digital", 2)]
+    for band, n in quotas:
+        pool = [c for c in remaining if c["text_density_band"] == band and c["doc_id"] not in used]
+        rng.shuffle(pool)
+        for c in pool[:n]:
+            c["bucket_label"] = "existing_calibration"
+            selected.append(c)
+            used.add(c["doc_id"])
+    if len(selected) < EXISTING_SUCCESS_TARGET:
+        leftover = [c for c in remaining if c["doc_id"] not in used]
+        rng.shuffle(leftover)
+        for c in leftover[: EXISTING_SUCCESS_TARGET - len(selected)]:
+            c["bucket_label"] = "existing_calibration"
+            selected.append(c)
+            used.add(c["doc_id"])
+    return selected[:EXISTING_SUCCESS_TARGET]
+
+
+def _print_distribution(samples: list[dict]) -> None:
+    print(f"\n선정 {len(samples)}건 (목표 {PILOT_TARGET}):\n")
+    print(f"{'ID':>6} {'KB':>8}  {'src':<22} {'bucket':<22} {'doctype':<22} {'density':>8}  title")
+    print("-" * 160)
+    for s in sorted(samples, key=lambda x: (x["sample_source"], x["bucket_label"], x["doc_id"])):
+        d = s["text_density"]
+        density_s = f"{d:.2f}" if d is not None else "-"
+        print(
+            f"{s['doc_id']:>6} {(s['file_size'] // 1024):>8}  "
+            f"{s['sample_source'][:22]:<22} "
+            f"{s['bucket_label'][:22]:<22} "
+            f"{(s['doc_type'] or '-')[:22]:<22} "
+            f"{density_s:>8}  "
+            f"{(s['title'] or '-')[:60]}"
+        )
+
+    by_axis = lambda key: Counter(s[key] for s in samples)
+    print("\n분포:")
+    for axis in ("sample_source", "bucket_label", "doc_type", "file_size_band", "text_density_band", "handwritten_hint", "script_mix"):
+        c = by_axis(axis)
+        line = ", ".join(f"{k}={v}" for k, v in sorted(c.items(), key=lambda x: -x[1]))
+        print(f"  {axis}: {line}")
+
+
+async def cmd_select(out_path: Path, csv_path: Path | None) -> None:
    engine = _build_engine()
    Session = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)

    from models.document import Document  # type: ignore

    async with Session() as session:
-        rows = (
+        # A. existing_success — 기존에 marker_worker 가 변환 성공한 PDFs (anchor + calibration).
+        es_rows = (
            await session.execute(
                select(
-                    Document.id,
-                    Document.title,
-                    Document.ai_domain,
-                    Document.document_type,
-                    Document.file_size,
-                    Document.file_path,
-                    Document.file_format,
-                    Document.category,
-                    Document.md_status,
+                    Document.id, Document.title, Document.ai_domain, Document.document_type,
+                    Document.file_size, Document.file_path, Document.file_format,
+                    Document.extracted_text, Document.md_extraction_quality,
+                ).where(
+                    Document.deleted_at.is_(None),
+                    Document.file_format == "pdf",
+                    Document.md_status == "success",
+                )
+            )
+        ).all()
+
+        # B. controlled_backfill — pending PDF document, SKIP_DOC_TYPES 제외.
+        cb_rows = (
+            await session.execute(
+                select(
+                    Document.id, Document.title, Document.ai_domain, Document.document_type,
+                    Document.file_size, Document.file_path, Document.file_format,
+                    Document.extracted_text, Document.md_extraction_quality,
                ).where(
                    Document.deleted_at.is_(None),
                    Document.file_format == "pdf",
@@ -97,79 +418,67 @@ async def cmd_select(out_path: Path) -> None:
            )
        ).all()

-    candidates = [
-        r for r in rows
-        if not (r.document_type and r.document_type in SKIP_DOC_TYPES)
+    cb_candidates_raw = [
+        r for r in cb_rows if not (r.document_type and r.document_type in SKIP_DOC_TYPES)
    ]
-    print(f"후보 (필터 후): {len(candidates)}건  (전체 pending PDF document: {len(rows)}건)")
+    print(
+        f"existing_success PDFs: {len(es_rows)}건  /  "
+        f"controlled_backfill 후보 (SKIP 제외): {len(cb_candidates_raw)}건  "
+        f"(전체 pending PDF document: {len(cb_rows)}건)"
+    )

-    grouped: dict[tuple[str, str], list] = defaultdict(list)
-    for r in candidates:
-        domain = r.ai_domain or "unknown"
-        grouped[(domain, _bucket(r.file_size))].append(r)
+    rng = random.Random(SAMPLE_SEED)

-    cells = sorted(grouped.keys())
-    base_per_cell = max(1, PILOT_TARGET // max(1, len(cells)))
+    existing_samples = _allocate_existing_success(es_rows, rng)
+    for s in existing_samples:
+        s["sample_source"] = "existing_success"
+        s.setdefault("forced_include_reason", "")

-    sample: list = []
-    leftover_cells: list[tuple[tuple[str, str], list]] = []
-    for cell in cells:
-        items = sorted(grouped[cell], key=lambda x: (x.file_size or 0, x.id))
-        take = min(base_per_cell, len(items))
-        if take >= 2:
-            half = take // 2
-            sample.extend(items[:half])             # 작은 쪽
-            sample.extend(items[-(take - half):])   # 큰 쪽
-        else:
-            sample.extend(items[:take])
-        if len(items) > take:
-            leftover_cells.append((cell, items[take:]))
+    cb_candidates = [_enrich_row(r) for r in cb_candidates_raw]
+    backfill_samples = _allocate_controlled_backfill(cb_candidates, rng)
+    for s in backfill_samples:
+        s["sample_source"] = "controlled_backfill"
+        s.setdefault("forced_include_reason", "")

-    leftover_cells.sort(key=lambda x: -len(x[1]))
-    li = 0
-    while len(sample) < PILOT_TARGET and li < len(leftover_cells):
-        _, items = leftover_cells[li]
-        if items:
-            sample.append(items.pop(0))
-        else:
-            li += 1
-    sample = sample[:PILOT_TARGET]
+    samples = existing_samples + backfill_samples

-    print(f"\n선정 {len(sample)}건 (목표 {PILOT_TARGET}):\n")
-    print(f"{'ID':>6} {'KB':>8}  {'domain':<22} {'doctype':<22} title")
-    print("-" * 130)
-    for r in sample:
-        print(
-            f"{r.id:>6} {((r.file_size or 0) // 1024):>8}  "
-            f"{(r.ai_domain or '-')[:22]:<22} "
-            f"{(r.document_type or '-')[:22]:<22} "
-            f"{(r.title or '-')[:60]}"
-        )
+    _print_distribution(samples)

-    cell_counts: Counter = Counter()
-    for r in sample:
-        cell_counts[((r.ai_domain or "unknown"), _bucket(r.file_size))] += 1
-    print("\n분포 (ai_domain × file_size_bucket):")
-    for (d, b), c in sorted(cell_counts.items()):
-        print(f"  {d:<22} × {b:<8} : {c}")
+    # CSV 저장 (사용자 review + commit 대상)
+    if csv_path is not None:
+        csv_path.parent.mkdir(parents=True, exist_ok=True)
+        with csv_path.open("w", newline="", encoding="utf-8") as f:
+            writer = csv.DictWriter(f, fieldnames=CSV_COLUMNS, extrasaction="ignore")
+            writer.writeheader()
+            for s in samples:
+                row = {col: s.get(col) for col in CSV_COLUMNS}
+                # bool → str (CSV 가독성)
+                row["scan_likely"] = "true" if s.get("scan_likely") else "false"
+                writer.writerow(row)
+        print(f"\nCSV 저장: {csv_path}")

+    # JSON 저장 — cmd_enqueue / cmd_report 호환용 (기존 schema 유지)
    payload = {
        "target": PILOT_TARGET,
-        "ids": [r.id for r in sample],
+        "seed": SAMPLE_SEED,
+        "ids": [s["doc_id"] for s in samples],
        "items": [
            {
-                "id": r.id,
-                "title": r.title,
-                "ai_domain": r.ai_domain,
-                "document_type": r.document_type,
-                "file_size": r.file_size,
-                "file_path": r.file_path,
+                "id": s["doc_id"],
+                "title": s["title"],
+                "ai_domain": s.get("_ai_domain"),
+                "document_type": s["doc_type"],
+                "file_size": s["file_size"],
+                "file_path": s.get("_file_path"),
+                "sample_source": s["sample_source"],
+                "bucket_label": s["bucket_label"],
+                "forced_include_reason": s.get("forced_include_reason", ""),
            }
-            for r in sample
+            for s in samples
        ],
    }
    out_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2))
-    print(f"\n저장: {out_path}")
+    print(f"JSON 저장: {out_path}")
    await engine.dispose()


@@ -383,24 +692,71 @@ async def cmd_report(in_path: Path) -> None:
    await engine.dispose()


+EVAL_TEMPLATE_COLUMNS = [
+    "doc_id", "title", "sample_source", "bucket_label",
+    # rubric 5축 1~5점 (사용자 작성). plan §"Quality evaluation rubric".
+    "text_accuracy", "structure", "noise_rate", "multi_script", "completeness",
+    "overall_pass",  # boolean (true/false) — "검색/참고에 쓸 만한가" 직관 판단
+    "notes",         # 자유서술
+]
+
+
+def cmd_eval_template(in_path: Path, csv_out: Path) -> None:
+    """select 결과 JSON 을 읽어 평가용 빈 CSV 스켈레톤을 출력. 사용자가 점수 채움."""
+    payload = json.loads(in_path.read_text())
+    items = payload.get("items", [])
+    csv_out.parent.mkdir(parents=True, exist_ok=True)
+    with csv_out.open("w", newline="", encoding="utf-8") as f:
+        writer = csv.DictWriter(f, fieldnames=EVAL_TEMPLATE_COLUMNS, extrasaction="ignore")
+        writer.writeheader()
+        for it in items:
+            writer.writerow({
+                "doc_id": it["id"],
+                "title": it.get("title", ""),
+                "sample_source": it.get("sample_source", ""),
+                "bucket_label": it.get("bucket_label", ""),
+                "text_accuracy": "",
+                "structure": "",
+                "noise_rate": "",
+                "multi_script": "",
+                "completeness": "",
+                "overall_pass": "",
+                "notes": "",
+            })
+    print(f"eval template 저장: {csv_out}  ({len(items)} rows)")
+    print("rubric: 1~5점 (text_accuracy / structure / noise_rate / multi_script / completeness)")
+    print("        overall_pass = true/false ('검색/참고에 쓸 만한가' 직관 판단)")
+    print(f"평가 가이드: evals/markdown/README.md 참조")
+
+
 def main() -> None:
    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
    sub = parser.add_subparsers(dest="cmd", required=True)
-    p_sel = sub.add_parser("select", help="stratified 30건 dry-run")
-    p_sel.add_argument("--out", type=Path, default=DEFAULT_OUT)
+
+    p_sel = sub.add_parser("select", help="stratified 30건 dry-run + CSV+JSON 저장")
+    p_sel.add_argument("--out", type=Path, default=DEFAULT_OUT, help="JSON (cmd_enqueue/report 호환용)")
+    p_sel.add_argument("--csv", type=Path, default=DEFAULT_CSV, help="CSV (사용자 review + commit 대상)")
+
    p_enq = sub.add_parser("enqueue", help="markdown 큐 enqueue")
    p_enq.add_argument("--in", dest="in_path", type=Path, default=DEFAULT_OUT)
    p_enq.add_argument("--yes", action="store_true")
+
    p_rep = sub.add_parser("report", help="결과 집계")
    p_rep.add_argument("--in", dest="in_path", type=Path, default=DEFAULT_OUT)

+    p_evt = sub.add_parser("eval_template", help="평가 CSV 스켈레톤 출력 (사용자가 rubric 점수 채움)")
+    p_evt.add_argument("--in", dest="in_path", type=Path, default=DEFAULT_OUT)
+    p_evt.add_argument("--csv", type=Path, default=DEFAULT_EVAL_CSV)
+
    args = parser.parse_args()
    if args.cmd == "select":
-        asyncio.run(cmd_select(args.out))
+        asyncio.run(cmd_select(args.out, args.csv))
    elif args.cmd == "enqueue":
        asyncio.run(cmd_enqueue(args.in_path, args.yes))
    elif args.cmd == "report":
        asyncio.run(cmd_report(args.in_path))
+    elif args.cmd == "eval_template":
+        cmd_eval_template(args.in_path, args.csv)


 if __name__ == "__main__":