99abd287dc
scripts/calibrate_ask.py — ask_events 집계 + markdown report 영구 도구.
기능:
- argparse: --source / --prompt-version / --since / --until / --eval-split
(tuning|confirm|all, id 해시 기반 deterministic split) / --run-label /
--output / --format md|json / --compare-against / --sample-limit /
--fp-artifacts / --inspect-shape / --dry-run
- 9개 fetcher (모두 read-only SELECT):
- Q0 defense_layers shape inspect
- Q1 re-gate tier 분포
- Q2 max_rerank_score 히스토그램 (bucket × bin)
- Q3 classifier 혼동행렬
- Q4 verifier severity 분포 (cast + COALESCE NULL safe)
- Q5 hallucination_flags top-K (UNION ALL outer wrap, strong/weak 컬럼 유지)
- Q6 eval golden mismatch (eval_case_id 기반 join + query string fallback)
- Q7 FP candidate (case A/B/C 분리 + candidate_reason 컬럼 + LIMIT/3 분배)
- Q8 answer_length p25/p50/p75 분포 (E.3 v1↔v2 비교 축)
- markdown render + json baseline + delta compare (compare-against)
- FP CSV dump (artifacts/fp_candidates_{run_label}.csv) + is_true_fp 공란
- dry-run: tests/calibrate_fixtures/sample_ask_events.json 로 출력 검증
- --threshold-overrides: Step 0 feasibility 통과 후 v2 (현재 stub raise)
read-only 강제: INSERT/UPDATE/DELETE/ALTER/DROP/TRUNCATE 0건.
tests/calibrate_fixtures/sample_ask_events.json: dry-run snapshot fixture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>