Files
hyungi_document_server/evals/markdown/pilot_1d_sample.csv
Hyungi Ahn b09687d41d feat(scripts): Phase 1D Round 2 — controlled backfill stratification
기존 phase1d_pilot.py (단순 ai_domain × file_size 3-bucket) 를 plan
~/.claude/plans/stratified-mingling-otter.md 의 4축 + sample_source 분리
+ forced_include 로 augment.

Round 1 (ai_domain × file_size 3-bucket) 의 한계:
  pending PDFs 의 자연 분포만 반영 → 알려진 약점 (필기/스캔/한중일
  mixed OCR) 이 sample 에 안 들어옴. 1C 시각 확인에서 doc 4809
  (Note_240805_용접교육 필기) 가 실제로 그 패턴을 보였는데, 자연
  selection 에 맡기면 다음 라운드도 같은 case 가 빠질 위험.

Round 2 디자인:
  - 4 축 stratification: doc_type × file_size_band × text_density_band
    × handwritten_hint
  - sample_source ∈ {existing_success(5), controlled_backfill(25)}
  - forced_include doc 4809 — known bad anchor. 다음 튜닝/대안 도입 후
    같은 문서 재변환 결과와 1:1 비교 가능.
  - text_density = LENGTH(extracted_text) / (file_size / 1024) chars/KB
    가장 깨끗한 단일 proxy. 0.17(필기 4809) ↔ 94(born-digital 3759)
    양 끝 검증.
  - script_mix proxy: Hangul/CJK/Hiragana/Katakana/Latin Unicode block
    ratio → korean_dominant / mixed_korean_cjk / mixed_korean_latin /
    cjk_dominant / latin_dominant / unknown.
  - page_count_estimate: existing_success 는 md_extraction_quality.
    metrics.source_page_count 사용. controlled_backfill 은 NULL
    (marker 가 PyMuPDF 로 어차피 다시 읽음).
  - 시드 SAMPLE_SEED=20260502 고정, 재현성 보장.

Sample 분포 (실측 2026-05-02):
  bucket_label: born_digital=12, mixed=5, existing_calibration=4,
                handwritten=3, scan_likely=3, large=2, existing_anchor=1
  doc_type: Academic_Paper=7, study_note=6, Standard=5, Note=4,
            Reference=3, Manual=3, Drawing=1, Report=1
  file_size_band: M=14, S=12, L=4
  text_density_band: born-digital=15, scan-likely=9, mixed=6
  handwritten_hint: lo=26, hi=4 (모집단 1.1% 대비 13배 over-sample)
  forced anchor doc 4809 = density 0.17 (사용자 시각 확인의 그 문서)

새 subcommand:
  eval_template — pilot_1d_eval.csv 스켈레톤 (rubric 5축 1~5 +
  overall_pass + notes). 사용자가 MarkdownDoc + PDF 토글 비교하며
  점수 채움.

기존 cmd_enqueue (snapshot/backup/dedup) + cmd_report (quality 메트릭)
는 유지.

산출물:
  scripts/phase1d_pilot.py — 4축 + sample_source + forced_include +
    eval_template subcommand. CSV+JSON dual output.
  evals/markdown/README.md — rubric + decision matrix + workflow guide.
  evals/markdown/pilot_1d_sample.csv — 30 rows × 15 cols (시드 결과,
    재현성 보존).
  evals/markdown/pilot_1d_eval.csv — 빈 스켈레톤 (사용자 평가 후 채움).

실행 경계:
  Step 1~3 (selection / template / dry-run) = 본 PR 으로 완료.
  Step 4 (--yes enqueue, 실제 30건 markdown 큐 인입) = 사용자 timing
  승인 + 야간 단발 sweep 윈도우 (23:00~03:00 KST) 안에서 별도 실행.
  marker-service BATCH_SIZE=1, 30건 평균 5분/건 ≈ 2.5h.

Verify:
  GPU 서버 fastapi 컨테이너에서 select 실행 → 30건 sample CSV 생성됨.
  eval_template subcommand 동작 확인. enqueue dry-run 으로 30 doc_ids
  + snapshot 출력 후 사용자 취소 분기 확인.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:15:09 +09:00

6.0 KiB

1doc_idtitlesample_sourceforced_include_reasonbucket_labeldoc_typefile_sizefile_size_bandtext_lentext_densitytext_density_bandhandwritten_hintscan_likelyscript_mixpage_count_estimate
24809Note_240805_용접교육 필기existing_successknown_bad_handwritten_anchorexisting_anchorNote1089182M1770.166scan-likelyhitruelatin_dominant
35248작업자 재난안전사고 예방을 위한 위험성평가 기법 연구existing_successexisting_calibrationAcademic_Paper942262S31913.468scan-likelylotrueunknown
44068공업역학 동역학(제13판)_Chapter 21 3차원 강체 운동역학existing_successexisting_calibrationAcademic_Paper5429661M452538.534mixedlofalsemixed_hangul_latin
55189VIII-1_08-UBexisting_successexisting_calibrationStandard140322S40457295.235born-digitallofalseunknown
65141Structural Analysiss and Design of Process Equipment_00_Contentsexisting_successexisting_calibrationReference520220S3188362.758born-digitallofalselatin_dominant
74815Note_240830_소음진동교육 필기controlled_backfillhandwrittenDrawing12659094L35240.285scan-likelyhitrueunknown
84798Note_240528_다이아프람워크숍controlled_backfillhandwrittenNote236840S10304.453scan-likelyhitruehangul_dominant
94813Note_240827_필기controlled_backfillhandwrittenNote710770S430.062scan-likelyhitrueunknown
105151THE PIPE FABRICATORS BLUE BOOKcontrolled_backfillscan_likelyManual40063084L1364483.488scan-likelylotrueunknown
115268황현필의 진보를 위한 역사_6장 제주4-3사건의 왜국을 멈추라controlled_backfillscan_likelyNote6188759M87461.447scan-likelylotruehangul_dominant
125127표준기계설계(KS)_08_핀controlled_backfillscan_likelyStandard6703655M104231.592scan-likelylotrueunknown
1388552월 26일controlled_backfillmixedReport121611S204817.245mixedlofalseunknown
144061공업역학 동역학(제13판)_Chapter 14 질점의 운동역학_일과 에너지controlled_backfillmixedAcademic_Paper5850755M448117.843mixedlofalsemixed_hangul_latin
153782Safety and Health for Engineers_02_5 Local, International, and Voluntary Laws, Regulations, and Standardscontrolled_backfillmixedstudy_note4822580M468089.939mixedlofalselatin_dominant
165179Hydrogen-Embrittlementcontrolled_backfillmixedReference430502S940022.359mixedlofalsemixed_hangul_latin
175133압력용기 핸드북_기타controlled_backfillmixedReference1813221M5175429.228mixedlofalsemixed_hangul_latin
183757Industrial Safety and Health Management(7-ED)_2 Development of the safety and Health Functioncontrolled_backfillborn_digitalstudy_note2849250M13937250.089born-digitallofalselatin_dominant
193758Industrial Safety and Health Management(7-ED)_3 Concepts of Hazard Avoidancecontrolled_backfillborn_digitalstudy_note1506926M10600872.036born-digitallofalselatin_dominant
205163국내 지속가능경영보고서의 노동인권 분야에 대한 실태 분석controlled_backfillborn_digitalstudy_note640161S5442387.055born-digitallofalseunknown
215167우리나라 기업의 환경정보 공시 현황과 제도적 개선방안controlled_backfillborn_digitalAcademic_Paper718354S4439563.284born-digitallofalsehangul_dominant
225154국내 금속가공 중소기업의 스마트팩토리 활용 정도에 대한 실증적 연구controlled_backfillborn_digitalAcademic_Paper257730S2389694.942born-digitallofalsemixed_hangul_latin
235155스마트 팩토리의 전략적 활용 연구controlled_backfillborn_digitalAcademic_Paper693276S75466111.467born-digitallofalsemixed_hangul_latin
245137Pressure Vessel Design Manual_01 General Topicscontrolled_backfillborn_digitalManual2421078M12390252.405born-digitallofalselatin_dominant
255211PTB-4-2013_00_Forewordcontrolled_backfillborn_digitalStandard416742S2916271.656born-digitallofalseunknown
265178Hydrogen_Piping_and_Pipelines_ASME_Codecontrolled_backfillborn_digitalStandard3131091M1162861380.305born-digitallofalseunknown
275168TCoYourPaperlessOffice-4.0controlled_backfillborn_digitalManual1526520M253397169.98born-digitallofalseunknown
283765Industrial Safety and Health Management(7-ED)_10 Environmental Control and Noisecontrolled_backfillborn_digitalstudy_note1408586M8000358.16born-digitallofalselatin_dominant
293769Industrial Safety and Health Management(7-ED)_14 Materials Handling and Storagecontrolled_backfillborn_digitalstudy_note1785536M9190252.706born-digitallofalselatin_dominant
305274황현필의 진보를 위한 역사_12장 대한민국의 정신을 훼손하지 말라controlled_backfilllargeAcademic_Paper14996152L353982.417scan-likelylotruehangul_dominant
315180ASME Sec I 2025controlled_backfilllargeStandard14890413L1603702110.285born-digitallofalseunknown