feat(eval): v0.2 28 신규 case + 2026-05-23 baseline + analysis #25

Merged
hyungi merged 1 commits from feat/eval-v0-2-baseline-analysis into main 2026-05-23 13:03:23 +09:00
Owner

PR-1 (725a4e1) v0.2 schema + harness 위에 신규 28 case 추가 → 51 case
완성 + 현재 모델로 baseline 박제 + 약점 카테고리 analysis md.

신규 28 case 분포 (계획 +28 = standards +6 / english_only +8 / mixed +5
/ exam +7 / failure_expected +2 / ocr_derived 0):

  • standards 5 → 11 (KGS FP111/FU551 + 산안기준 후반 편 + 고압가스법)
  • english_only 1 → 9 (Pressure Vessel Design Manual + ASME VIII/IX +
    Hydrogen ASME + Industrial Safety 영문 교재 + Structural Analysis)
  • mixed 5 → 10 (한↔영 ASME / KGS-영문 / 양언어 압력용기)
  • exam 0 → 7 (가스기사 study_questions → library 개념 docs 매핑)
  • failure_expected 3 → 5 (KGS AC999 / 초전도 안전 관리법)
  • ocr_derived 0 (TBD-O FAILED: extract_meta NULL 21385, chunks.source
    = RSS feed 명. OCR 식별 컬럼 부재 → +4 case 재배분, analysis 명시)

baseline 측정 결과 (corpus 21,385, hybrid mode, bge-m3 + bge-reranker-v2-m3):

  • v0.1 Recall@10 0.646, MRR 0.724, NDCG 0.606, Top-3 0.891
  • v0.2 graded NDCG 0.659, Recall@10 g≥2 0.695, g≥3 0.761
  • latency p50 528ms / p95 1,664ms
  • failure precision 0/5 (DS confidence threshold 미적용)

약점 top 3 (analysis md):

  • mixed crosslingual 0.39 graded NDCG — TOP weakness, bge-m3
    multilingual 한계 추정
  • korean_only natural language 0.51 — query rewrite 부재 추정
  • failure_expected 0/5 — confidence cutoff 부재

Phase 2 dispatch 권고 (analysis md):

  • 2A Embedding bge-m3 — 즉시 진입 (mixed/korean 동시 타격)
  • 2B Reranker — M (2A 이후)
  • 2C OCR-Marker — 선행 chore (OCR 식별 컬럼 추가) 필요
  • 2D STT — 본 평가셋 외 (별 평가셋 필요)

Query rewrite 는 Phase 2Q/Search-PR 로 별도 분리.

영향 받는 파일:

  • tests/search_eval/queries.yaml: 23 → 51 case (기존 23 변경 0, append only)
  • tests/search_eval/baselines/v0_2_baseline_2026-05-23.json: 신규
  • tests/search_eval/baselines/v0_2_baseline_2026-05-23_analysis.md: 신규

PR plan: ~/.claude/plans/pr-2-serialized-hummingbird.md
Phase 1 plan: ~/.claude/plans/phase-1-graded-eval-v0-2.md

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

PR-1 (725a4e1) v0.2 schema + harness 위에 신규 28 case 추가 → 51 case 완성 + 현재 모델로 baseline 박제 + 약점 카테고리 analysis md. 신규 28 case 분포 (계획 +28 = standards +6 / english_only +8 / mixed +5 / exam +7 / failure_expected +2 / ocr_derived 0): - standards 5 → 11 (KGS FP111/FU551 + 산안기준 후반 편 + 고압가스법) - english_only 1 → 9 (Pressure Vessel Design Manual + ASME VIII/IX + Hydrogen ASME + Industrial Safety 영문 교재 + Structural Analysis) - mixed 5 → 10 (한↔영 ASME / KGS-영문 / 양언어 압력용기) - exam 0 → 7 (가스기사 study_questions → library 개념 docs 매핑) - failure_expected 3 → 5 (KGS AC999 / 초전도 안전 관리법) - ocr_derived 0 (TBD-O FAILED: extract_meta NULL 21385, chunks.source = RSS feed 명. OCR 식별 컬럼 부재 → +4 case 재배분, analysis 명시) baseline 측정 결과 (corpus 21,385, hybrid mode, bge-m3 + bge-reranker-v2-m3): - v0.1 Recall@10 0.646, MRR 0.724, NDCG 0.606, Top-3 0.891 - v0.2 graded NDCG 0.659, Recall@10 g≥2 0.695, g≥3 0.761 - latency p50 528ms / p95 1,664ms - failure precision 0/5 (DS confidence threshold 미적용) 약점 top 3 (analysis md): - mixed crosslingual 0.39 graded NDCG — TOP weakness, bge-m3 multilingual 한계 추정 - korean_only natural language 0.51 — query rewrite 부재 추정 - failure_expected 0/5 — confidence cutoff 부재 Phase 2 dispatch 권고 (analysis md): - 2A Embedding bge-m3 — 즉시 진입 (mixed/korean 동시 타격) - 2B Reranker — M (2A 이후) - 2C OCR-Marker — 선행 chore (OCR 식별 컬럼 추가) 필요 - 2D STT — 본 평가셋 외 (별 평가셋 필요) Query rewrite 는 Phase 2Q/Search-PR 로 별도 분리. 영향 받는 파일: - tests/search_eval/queries.yaml: 23 → 51 case (기존 23 변경 0, append only) - tests/search_eval/baselines/v0_2_baseline_2026-05-23.json: 신규 - tests/search_eval/baselines/v0_2_baseline_2026-05-23_analysis.md: 신규 PR plan: ~/.claude/plans/pr-2-serialized-hummingbird.md Phase 1 plan: ~/.claude/plans/phase-1-graded-eval-v0-2.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hyungi added 1 commit 2026-05-23 13:03:19 +09:00
PR-1 (725a4e1) v0.2 schema + harness 위에 신규 28 case 추가 → 51 case
완성 + 현재 모델로 baseline 박제 + 약점 카테고리 analysis md.

신규 28 case 분포 (계획 +28 = standards +6 / english_only +8 / mixed +5
/ exam +7 / failure_expected +2 / ocr_derived 0):
- standards 5 → 11 (KGS FP111/FU551 + 산안기준 후반 편 + 고압가스법)
- english_only 1 → 9 (Pressure Vessel Design Manual + ASME VIII/IX +
  Hydrogen ASME + Industrial Safety 영문 교재 + Structural Analysis)
- mixed 5 → 10 (한↔영 ASME / KGS-영문 / 양언어 압력용기)
- exam 0 → 7 (가스기사 study_questions → library 개념 docs 매핑)
- failure_expected 3 → 5 (KGS AC999 / 초전도 안전 관리법)
- ocr_derived 0 (TBD-O FAILED: extract_meta NULL 21385, chunks.source
  = RSS feed 명. OCR 식별 컬럼 부재 → +4 case 재배분, analysis 명시)

baseline 측정 결과 (corpus 21,385, hybrid mode, bge-m3 + bge-reranker-v2-m3):
- v0.1 Recall@10 0.646, MRR 0.724, NDCG 0.606, Top-3 0.891
- v0.2 graded NDCG 0.659, Recall@10 g≥2 0.695, g≥3 0.761
- latency p50 528ms / p95 1,664ms
- failure precision 0/5 (DS confidence threshold 미적용)

약점 top 3 (analysis md):
- mixed crosslingual 0.39 graded NDCG — TOP weakness, bge-m3
  multilingual 한계 추정
- korean_only natural language 0.51 — query rewrite 부재 추정
- failure_expected 0/5 — confidence cutoff 부재

Phase 2 dispatch 권고 (analysis md):
- 2A Embedding bge-m3 — 즉시 진입 (mixed/korean 동시 타격)
- 2B Reranker — M (2A 이후)
- 2C OCR-Marker — 선행 chore (OCR 식별 컬럼 추가) 필요
- 2D STT — 본 평가셋 외 (별 평가셋 필요)

Query rewrite 는 Phase 2Q/Search-PR 로 별도 분리.

영향 받는 파일:
- tests/search_eval/queries.yaml: 23 → 51 case (기존 23 변경 0, append only)
- tests/search_eval/baselines/v0_2_baseline_2026-05-23.json: 신규
- tests/search_eval/baselines/v0_2_baseline_2026-05-23_analysis.md: 신규

PR plan: ~/.claude/plans/pr-2-serialized-hummingbird.md
Phase 1 plan: ~/.claude/plans/phase-1-graded-eval-v0-2.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hyungi merged commit e4cfd81e15 into main 2026-05-23 13:03:23 +09:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: hyungi/hyungi_document_server#25