Files
Hyungi Ahn b09687d41d feat(scripts): Phase 1D Round 2 — controlled backfill stratification
기존 phase1d_pilot.py (단순 ai_domain × file_size 3-bucket) 를 plan
~/.claude/plans/stratified-mingling-otter.md 의 4축 + sample_source 분리
+ forced_include 로 augment.

Round 1 (ai_domain × file_size 3-bucket) 의 한계:
  pending PDFs 의 자연 분포만 반영 → 알려진 약점 (필기/스캔/한중일
  mixed OCR) 이 sample 에 안 들어옴. 1C 시각 확인에서 doc 4809
  (Note_240805_용접교육 필기) 가 실제로 그 패턴을 보였는데, 자연
  selection 에 맡기면 다음 라운드도 같은 case 가 빠질 위험.

Round 2 디자인:
  - 4 축 stratification: doc_type × file_size_band × text_density_band
    × handwritten_hint
  - sample_source ∈ {existing_success(5), controlled_backfill(25)}
  - forced_include doc 4809 — known bad anchor. 다음 튜닝/대안 도입 후
    같은 문서 재변환 결과와 1:1 비교 가능.
  - text_density = LENGTH(extracted_text) / (file_size / 1024) chars/KB
    가장 깨끗한 단일 proxy. 0.17(필기 4809) ↔ 94(born-digital 3759)
    양 끝 검증.
  - script_mix proxy: Hangul/CJK/Hiragana/Katakana/Latin Unicode block
    ratio → korean_dominant / mixed_korean_cjk / mixed_korean_latin /
    cjk_dominant / latin_dominant / unknown.
  - page_count_estimate: existing_success 는 md_extraction_quality.
    metrics.source_page_count 사용. controlled_backfill 은 NULL
    (marker 가 PyMuPDF 로 어차피 다시 읽음).
  - 시드 SAMPLE_SEED=20260502 고정, 재현성 보장.

Sample 분포 (실측 2026-05-02):
  bucket_label: born_digital=12, mixed=5, existing_calibration=4,
                handwritten=3, scan_likely=3, large=2, existing_anchor=1
  doc_type: Academic_Paper=7, study_note=6, Standard=5, Note=4,
            Reference=3, Manual=3, Drawing=1, Report=1
  file_size_band: M=14, S=12, L=4
  text_density_band: born-digital=15, scan-likely=9, mixed=6
  handwritten_hint: lo=26, hi=4 (모집단 1.1% 대비 13배 over-sample)
  forced anchor doc 4809 = density 0.17 (사용자 시각 확인의 그 문서)

새 subcommand:
  eval_template — pilot_1d_eval.csv 스켈레톤 (rubric 5축 1~5 +
  overall_pass + notes). 사용자가 MarkdownDoc + PDF 토글 비교하며
  점수 채움.

기존 cmd_enqueue (snapshot/backup/dedup) + cmd_report (quality 메트릭)
는 유지.

산출물:
  scripts/phase1d_pilot.py — 4축 + sample_source + forced_include +
    eval_template subcommand. CSV+JSON dual output.
  evals/markdown/README.md — rubric + decision matrix + workflow guide.
  evals/markdown/pilot_1d_sample.csv — 30 rows × 15 cols (시드 결과,
    재현성 보존).
  evals/markdown/pilot_1d_eval.csv — 빈 스켈레톤 (사용자 평가 후 채움).

실행 경계:
  Step 1~3 (selection / template / dry-run) = 본 PR 으로 완료.
  Step 4 (--yes enqueue, 실제 30건 markdown 큐 인입) = 사용자 timing
  승인 + 야간 단발 sweep 윈도우 (23:00~03:00 KST) 안에서 별도 실행.
  marker-service BATCH_SIZE=1, 30건 평균 5분/건 ≈ 2.5h.

Verify:
  GPU 서버 fastapi 컨테이너에서 select 실행 → 30건 sample CSV 생성됨.
  eval_template subcommand 동작 확인. enqueue dry-run 으로 30 doc_ids
  + snapshot 출력 후 사용자 취소 분기 확인.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:15:09 +09:00

32 lines
6.0 KiB
CSV

doc_id,title,sample_source,forced_include_reason,bucket_label,doc_type,file_size,file_size_band,text_len,text_density,text_density_band,handwritten_hint,scan_likely,script_mix,page_count_estimate
4809,Note_240805_용접교육 필기,existing_success,known_bad_handwritten_anchor,existing_anchor,Note,1089182,M,177,0.166,scan-likely,hi,true,latin_dominant,
5248,작업자 재난안전사고 예방을 위한 위험성평가 기법 연구,existing_success,,existing_calibration,Academic_Paper,942262,S,3191,3.468,scan-likely,lo,true,unknown,
4068,공업역학 동역학(제13판)_Chapter 21 3차원 강체 운동역학,existing_success,,existing_calibration,Academic_Paper,5429661,M,45253,8.534,mixed,lo,false,mixed_hangul_latin,
5189,VIII-1_08-UB,existing_success,,existing_calibration,Standard,140322,S,40457,295.235,born-digital,lo,false,unknown,
5141,Structural Analysiss and Design of Process Equipment_00_Contents,existing_success,,existing_calibration,Reference,520220,S,31883,62.758,born-digital,lo,false,latin_dominant,
4815,Note_240830_소음진동교육 필기,controlled_backfill,,handwritten,Drawing,12659094,L,3524,0.285,scan-likely,hi,true,unknown,
4798,Note_240528_다이아프람워크숍,controlled_backfill,,handwritten,Note,236840,S,1030,4.453,scan-likely,hi,true,hangul_dominant,
4813,Note_240827_필기,controlled_backfill,,handwritten,Note,710770,S,43,0.062,scan-likely,hi,true,unknown,
5151,THE PIPE FABRICATORS BLUE BOOK,controlled_backfill,,scan_likely,Manual,40063084,L,136448,3.488,scan-likely,lo,true,unknown,
5268,황현필의 진보를 위한 역사_6장 제주4-3사건의 왜국을 멈추라,controlled_backfill,,scan_likely,Note,6188759,M,8746,1.447,scan-likely,lo,true,hangul_dominant,
5127,표준기계설계(KS)_08_핀,controlled_backfill,,scan_likely,Standard,6703655,M,10423,1.592,scan-likely,lo,true,unknown,
8855,2월 26일,controlled_backfill,,mixed,Report,121611,S,2048,17.245,mixed,lo,false,unknown,
4061,공업역학 동역학(제13판)_Chapter 14 질점의 운동역학_일과 에너지,controlled_backfill,,mixed,Academic_Paper,5850755,M,44811,7.843,mixed,lo,false,mixed_hangul_latin,
3782,"Safety and Health for Engineers_02_5 Local, International, and Voluntary Laws, Regulations, and Standards",controlled_backfill,,mixed,study_note,4822580,M,46808,9.939,mixed,lo,false,latin_dominant,
5179,Hydrogen-Embrittlement,controlled_backfill,,mixed,Reference,430502,S,9400,22.359,mixed,lo,false,mixed_hangul_latin,
5133,압력용기 핸드북_기타,controlled_backfill,,mixed,Reference,1813221,M,51754,29.228,mixed,lo,false,mixed_hangul_latin,
3757,Industrial Safety and Health Management(7-ED)_2 Development of the safety and Health Function,controlled_backfill,,born_digital,study_note,2849250,M,139372,50.089,born-digital,lo,false,latin_dominant,
3758,Industrial Safety and Health Management(7-ED)_3 Concepts of Hazard Avoidance,controlled_backfill,,born_digital,study_note,1506926,M,106008,72.036,born-digital,lo,false,latin_dominant,
5163,국내 지속가능경영보고서의 노동인권 분야에 대한 실태 분석,controlled_backfill,,born_digital,study_note,640161,S,54423,87.055,born-digital,lo,false,unknown,
5167,우리나라 기업의 환경정보 공시 현황과 제도적 개선방안,controlled_backfill,,born_digital,Academic_Paper,718354,S,44395,63.284,born-digital,lo,false,hangul_dominant,
5154,국내 금속가공 중소기업의 스마트팩토리 활용 정도에 대한 실증적 연구,controlled_backfill,,born_digital,Academic_Paper,257730,S,23896,94.942,born-digital,lo,false,mixed_hangul_latin,
5155,스마트 팩토리의 전략적 활용 연구,controlled_backfill,,born_digital,Academic_Paper,693276,S,75466,111.467,born-digital,lo,false,mixed_hangul_latin,
5137,Pressure Vessel Design Manual_01 General Topics,controlled_backfill,,born_digital,Manual,2421078,M,123902,52.405,born-digital,lo,false,latin_dominant,
5211,PTB-4-2013_00_Foreword,controlled_backfill,,born_digital,Standard,416742,S,29162,71.656,born-digital,lo,false,unknown,
5178,Hydrogen_Piping_and_Pipelines_ASME_Code,controlled_backfill,,born_digital,Standard,3131091,M,1162861,380.305,born-digital,lo,false,unknown,
5168,TCoYourPaperlessOffice-4.0,controlled_backfill,,born_digital,Manual,1526520,M,253397,169.98,born-digital,lo,false,unknown,
3765,Industrial Safety and Health Management(7-ED)_10 Environmental Control and Noise,controlled_backfill,,born_digital,study_note,1408586,M,80003,58.16,born-digital,lo,false,latin_dominant,
3769,Industrial Safety and Health Management(7-ED)_14 Materials Handling and Storage,controlled_backfill,,born_digital,study_note,1785536,M,91902,52.706,born-digital,lo,false,latin_dominant,
5274,황현필의 진보를 위한 역사_12장 대한민국의 정신을 훼손하지 말라,controlled_backfill,,large,Academic_Paper,14996152,L,35398,2.417,scan-likely,lo,true,hangul_dominant,
5180,ASME Sec I 2025,controlled_backfill,,large,Standard,14890413,L,1603702,110.285,born-digital,lo,false,unknown,