Files
hyungi_document_server/tests/calibrate_fixtures/sample_ask_events.json
T
Hyungi Ahn 99abd287dc feat(scripts): Phase 3.5 — calibrate_ask.py CLI (Q0~Q8 + render + FP CSV)
scripts/calibrate_ask.py — ask_events 집계 + markdown report 영구 도구.

기능:
- argparse: --source / --prompt-version / --since / --until / --eval-split
  (tuning|confirm|all, id 해시 기반 deterministic split) / --run-label /
  --output / --format md|json / --compare-against / --sample-limit /
  --fp-artifacts / --inspect-shape / --dry-run
- 9개 fetcher (모두 read-only SELECT):
  - Q0 defense_layers shape inspect
  - Q1 re-gate tier 분포
  - Q2 max_rerank_score 히스토그램 (bucket × bin)
  - Q3 classifier 혼동행렬
  - Q4 verifier severity 분포 (cast + COALESCE NULL safe)
  - Q5 hallucination_flags top-K (UNION ALL outer wrap, strong/weak 컬럼 유지)
  - Q6 eval golden mismatch (eval_case_id 기반 join + query string fallback)
  - Q7 FP candidate (case A/B/C 분리 + candidate_reason 컬럼 + LIMIT/3 분배)
  - Q8 answer_length p25/p50/p75 분포 (E.3 v1↔v2 비교 축)
- markdown render + json baseline + delta compare (compare-against)
- FP CSV dump (artifacts/fp_candidates_{run_label}.csv) + is_true_fp 공란
- dry-run: tests/calibrate_fixtures/sample_ask_events.json 로 출력 검증
- --threshold-overrides: Step 0 feasibility 통과 후 v2 (현재 stub raise)

read-only 강제: INSERT/UPDATE/DELETE/ALTER/DROP/TRUNCATE 0건.

tests/calibrate_fixtures/sample_ask_events.json: dry-run snapshot fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 08:11:06 +09:00

64 lines
2.5 KiB
JSON

{
"total_rows": 10,
"regate": [
{"tier": "clean", "n": 5, "pct": 50.0},
{"tier": "partial(strong_or_negation)", "n": 3, "pct": 30.0},
{"tier": "refuse(grounding_2+strong)", "n": 1, "pct": 10.0},
{"tier": "conf_low(medium_x3)", "n": 1, "pct": 10.0}
],
"score_hist": [
{"bucket": "full", "bin": 9, "n": 4, "avg_score": 0.87},
{"bucket": "full", "bin": 8, "n": 1, "avg_score": 0.78},
{"bucket": "partial", "bin": 5, "n": 3, "avg_score": 0.51},
{"bucket": "refused", "bin": 2, "n": 1, "avg_score": 0.18},
{"bucket": "insufficient", "bin": 1, "n": 1, "avg_score": 0.08}
],
"classifier": [
{"verdict": "sufficient", "completeness": "full", "refused": false, "n": 5},
{"verdict": "sufficient", "completeness": "partial", "refused": false, "n": 3},
{"verdict": "insufficient", "completeness": "insufficient", "refused": true, "n": 2}
],
"verifier": [
{"status": "ok", "medium_count": 0, "strong_count": 0, "completeness": "full", "n": 5},
{"status": "ok", "medium_count": 1, "strong_count": 0, "completeness": "partial", "n": 2},
{"status": "ok", "medium_count": 3, "strong_count": 0, "completeness": "partial", "n": 1},
{"status": "skipped", "medium_count": 0, "strong_count": 0, "completeness": "insufficient", "n": 2}
],
"flags": [
{"flag_type": "fabricated_number", "strength": "strong", "n": 2},
{"flag_type": "uncited_claim", "strength": "weak", "n": 4},
{"flag_type": "low_overlap", "strength": "weak", "n": 3},
{"flag_type": "intent_misalignment", "strength": "strong", "n": 1}
],
"fabricated_rate": {
"total": 10,
"fabricated_strong_hit": 2,
"rate": 0.2
},
"fp_candidates": [
{
"id": 101,
"candidate_reason": "refused_high_rerank",
"query": "샘플 질의 1",
"completeness": "insufficient",
"refused": true,
"classifier_verdict": "insufficient",
"max_rerank_score": 0.42,
"aggregate_score": 1.05,
"g_strong": [],
"v_medium": "0",
"re_gate": "refuse(score_gate)",
"answer_length": 0,
"prompt_version": "search_synthesis.v1-400char",
"source": "eval",
"eval_case_id": "ask_def_001",
"created_at": "2026-04-17T08:00:00+00:00"
}
],
"answer_length": [
{"bucket": "full", "p25": 280, "p50": 350, "p75": 395, "avg": 340, "n": 5},
{"bucket": "partial", "p25": 200, "p50": 260, "p75": 320, "avg": 255, "n": 3},
{"bucket": "refused", "p25": 0, "p50": 0, "p75": 0, "avg": 0, "n": 2}
]
}