feat(ask): Phase 3.5b guardrails — verifier + telemetry + grounding 강화

Phase 3.5a(classifier+refusal gate+grounding) 위에 4개 Item 추가: Item 0: ask_events telemetry 배선 - AskEvent ORM 모델 + record_ask_event() — ask_events INSERT 완성 - defense_layers에 input_snapshot(query, chunks, answer) 저장 - refused/normal 두 경로 모두 telemetry 호출 Item 3: evidence 간 numeric conflict detection - 동일 단위 다른 숫자 → weak flag - "이상/이하/초과/미만" threshold 표현 → skip (FP 방지) Item 4: fabricated_number normalization 개선 - 단위 접미사 건/원 추가, 범위 표현(10~20%) 양쪽 추출 - bare number 2자리 이상만 (1자리 FP 제거) Item 1: exaone semantic verifier (판단권 잠금 배선) - verifier_service.py — 3s timeout, circuit breaker, severity 3단계 - direct_negation만 strong, numeric/intent→medium, 나머지→weak - verifier strong 단독 refuse 금지 — grounding과 교차 필수 - 6-tier re-gate (4라운드 리뷰 확정) - grounding strong 2+ OR max_score<0.2 → verifier skip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 09:49:56 +09:00
parent a0e1717206
commit b2306c3afd
9 changed files with 533 additions and 20 deletions
--- a/app/prompts/verifier.txt
+++ b/app/prompts/verifier.txt
@@ -0,0 +1,41 @@
+You are a grounding verifier. Given an answer and its evidence sources, check if the answer contradicts or fabricates information. Respond ONLY in JSON.
+
+## Contradiction Types (IMPORTANT — severity depends on type)
+- **direct_negation** (CRITICAL): Answer directly contradicts evidence. Examples: evidence "의무" but answer "권고"; evidence "금지" but answer "허용"; negation reversal ("~해야 한다" vs "~할 필요 없다").
+- **numeric_conflict**: Answer states a number different from evidence. "50명" in evidence but "100명" in answer. Only flag if the same concept is referenced.
+- **intent_core_mismatch**: Answer addresses a fundamentally different topic than the query asked about.
+- **nuance**: Answer overgeneralizes or adds qualifiers not in evidence (e.g., "모든" when evidence says "일부").
+- **unsupported_claim**: Answer makes a factual claim with no basis in any evidence.
+
+## Rules
+1. Compare each claim in the answer against the cited evidence. A claim with [n] citation should be checked against evidence [n].
+2. NOT a contradiction: Paraphrasing, summarizing, or restating the same fact in different words. Korean formal/informal style (합니다/한다) differences.
+3. Numbers must match exactly after normalization (1,000 = 1000).
+4. Legal/regulatory terms must preserve original meaning (의무 ≠ 권고, 금지 ≠ 제한, 허용 ≠ 금지).
+5. Maximum 5 contradictions (most severe first: direct_negation > numeric_conflict > intent_core_mismatch > nuance > unsupported_claim).
+
+## Output Schema
+{
+  "contradictions": [
+    {
+      "type": "direct_negation" | "numeric_conflict" | "intent_core_mismatch" | "nuance" | "unsupported_claim",
+      "severity": "critical" | "minor",
+      "claim": "answer 내 해당 구절 (50자 이내)",
+      "evidence_ref": "대응 근거 내용 (50자 이내, [n] 포함)",
+      "explanation": "모순 이유 (한국어, 30자 이내)"
+    }
+  ],
+  "verdict": "clean" | "minor_issues" | "major_issues"
+}
+
+severity mapping:
+- direct_negation → "critical"
+- All others → "minor"
+
+If no contradictions: {"contradictions": [], "verdict": "clean"}
+
+## Answer
+{answer}
+
+## Evidence
+{numbered_evidence}