feat(scripts): Phase 3.5 — calibrate_ask.py CLI (Q0~Q8 + render + FP CSV)
scripts/calibrate_ask.py — ask_events 집계 + markdown report 영구 도구.
기능:
- argparse: --source / --prompt-version / --since / --until / --eval-split
(tuning|confirm|all, id 해시 기반 deterministic split) / --run-label /
--output / --format md|json / --compare-against / --sample-limit /
--fp-artifacts / --inspect-shape / --dry-run
- 9개 fetcher (모두 read-only SELECT):
- Q0 defense_layers shape inspect
- Q1 re-gate tier 분포
- Q2 max_rerank_score 히스토그램 (bucket × bin)
- Q3 classifier 혼동행렬
- Q4 verifier severity 분포 (cast + COALESCE NULL safe)
- Q5 hallucination_flags top-K (UNION ALL outer wrap, strong/weak 컬럼 유지)
- Q6 eval golden mismatch (eval_case_id 기반 join + query string fallback)
- Q7 FP candidate (case A/B/C 분리 + candidate_reason 컬럼 + LIMIT/3 분배)
- Q8 answer_length p25/p50/p75 분포 (E.3 v1↔v2 비교 축)
- markdown render + json baseline + delta compare (compare-against)
- FP CSV dump (artifacts/fp_candidates_{run_label}.csv) + is_true_fp 공란
- dry-run: tests/calibrate_fixtures/sample_ask_events.json 로 출력 검증
- --threshold-overrides: Step 0 feasibility 통과 후 v2 (현재 stub raise)
read-only 강제: INSERT/UPDATE/DELETE/ALTER/DROP/TRUNCATE 0건.
tests/calibrate_fixtures/sample_ask_events.json: dry-run snapshot fixture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,63 @@
|
||||
{
|
||||
"total_rows": 10,
|
||||
"regate": [
|
||||
{"tier": "clean", "n": 5, "pct": 50.0},
|
||||
{"tier": "partial(strong_or_negation)", "n": 3, "pct": 30.0},
|
||||
{"tier": "refuse(grounding_2+strong)", "n": 1, "pct": 10.0},
|
||||
{"tier": "conf_low(medium_x3)", "n": 1, "pct": 10.0}
|
||||
],
|
||||
"score_hist": [
|
||||
{"bucket": "full", "bin": 9, "n": 4, "avg_score": 0.87},
|
||||
{"bucket": "full", "bin": 8, "n": 1, "avg_score": 0.78},
|
||||
{"bucket": "partial", "bin": 5, "n": 3, "avg_score": 0.51},
|
||||
{"bucket": "refused", "bin": 2, "n": 1, "avg_score": 0.18},
|
||||
{"bucket": "insufficient", "bin": 1, "n": 1, "avg_score": 0.08}
|
||||
],
|
||||
"classifier": [
|
||||
{"verdict": "sufficient", "completeness": "full", "refused": false, "n": 5},
|
||||
{"verdict": "sufficient", "completeness": "partial", "refused": false, "n": 3},
|
||||
{"verdict": "insufficient", "completeness": "insufficient", "refused": true, "n": 2}
|
||||
],
|
||||
"verifier": [
|
||||
{"status": "ok", "medium_count": 0, "strong_count": 0, "completeness": "full", "n": 5},
|
||||
{"status": "ok", "medium_count": 1, "strong_count": 0, "completeness": "partial", "n": 2},
|
||||
{"status": "ok", "medium_count": 3, "strong_count": 0, "completeness": "partial", "n": 1},
|
||||
{"status": "skipped", "medium_count": 0, "strong_count": 0, "completeness": "insufficient", "n": 2}
|
||||
],
|
||||
"flags": [
|
||||
{"flag_type": "fabricated_number", "strength": "strong", "n": 2},
|
||||
{"flag_type": "uncited_claim", "strength": "weak", "n": 4},
|
||||
{"flag_type": "low_overlap", "strength": "weak", "n": 3},
|
||||
{"flag_type": "intent_misalignment", "strength": "strong", "n": 1}
|
||||
],
|
||||
"fabricated_rate": {
|
||||
"total": 10,
|
||||
"fabricated_strong_hit": 2,
|
||||
"rate": 0.2
|
||||
},
|
||||
"fp_candidates": [
|
||||
{
|
||||
"id": 101,
|
||||
"candidate_reason": "refused_high_rerank",
|
||||
"query": "샘플 질의 1",
|
||||
"completeness": "insufficient",
|
||||
"refused": true,
|
||||
"classifier_verdict": "insufficient",
|
||||
"max_rerank_score": 0.42,
|
||||
"aggregate_score": 1.05,
|
||||
"g_strong": [],
|
||||
"v_medium": "0",
|
||||
"re_gate": "refuse(score_gate)",
|
||||
"answer_length": 0,
|
||||
"prompt_version": "search_synthesis.v1-400char",
|
||||
"source": "eval",
|
||||
"eval_case_id": "ask_def_001",
|
||||
"created_at": "2026-04-17T08:00:00+00:00"
|
||||
}
|
||||
],
|
||||
"answer_length": [
|
||||
{"bucket": "full", "p25": 280, "p50": 350, "p75": 395, "avg": 340, "n": 5},
|
||||
{"bucket": "partial", "p25": 200, "p50": 260, "p75": 320, "avg": 255, "n": 3},
|
||||
{"bucket": "refused", "p25": 0, "p50": 0, "p75": 0, "avg": 0, "n": 2}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user