feat(ask): Phase 3.5b guardrails — verifier + telemetry + grounding 강화

Phase 3.5a(classifier+refusal gate+grounding) 위에 4개 Item 추가:

Item 0: ask_events telemetry 배선
- AskEvent ORM 모델 + record_ask_event() — ask_events INSERT 완성
- defense_layers에 input_snapshot(query, chunks, answer) 저장
- refused/normal 두 경로 모두 telemetry 호출

Item 3: evidence 간 numeric conflict detection
- 동일 단위 다른 숫자 → weak flag
- "이상/이하/초과/미만" threshold 표현 → skip (FP 방지)

Item 4: fabricated_number normalization 개선
- 단위 접미사 건/원 추가, 범위 표현(10~20%) 양쪽 추출
- bare number 2자리 이상만 (1자리 FP 제거)

Item 1: exaone semantic verifier (판단권 잠금 배선)
- verifier_service.py — 3s timeout, circuit breaker, severity 3단계
- direct_negation만 strong, numeric/intent→medium, 나머지→weak
- verifier strong 단독 refuse 금지 — grounding과 교차 필수
- 6-tier re-gate (4라운드 리뷰 확정)
- grounding strong 2+ OR max_score<0.2 → verifier skip

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hyungi Ahn
2026-04-10 09:49:56 +09:00
parent a0e1717206
commit b2306c3afd
9 changed files with 533 additions and 20 deletions

41
app/prompts/verifier.txt Normal file
View File

@@ -0,0 +1,41 @@
You are a grounding verifier. Given an answer and its evidence sources, check if the answer contradicts or fabricates information. Respond ONLY in JSON.
## Contradiction Types (IMPORTANT — severity depends on type)
- **direct_negation** (CRITICAL): Answer directly contradicts evidence. Examples: evidence "의무" but answer "권고"; evidence "금지" but answer "허용"; negation reversal ("~해야 한다" vs "~할 필요 없다").
- **numeric_conflict**: Answer states a number different from evidence. "50명" in evidence but "100명" in answer. Only flag if the same concept is referenced.
- **intent_core_mismatch**: Answer addresses a fundamentally different topic than the query asked about.
- **nuance**: Answer overgeneralizes or adds qualifiers not in evidence (e.g., "모든" when evidence says "일부").
- **unsupported_claim**: Answer makes a factual claim with no basis in any evidence.
## Rules
1. Compare each claim in the answer against the cited evidence. A claim with [n] citation should be checked against evidence [n].
2. NOT a contradiction: Paraphrasing, summarizing, or restating the same fact in different words. Korean formal/informal style (합니다/한다) differences.
3. Numbers must match exactly after normalization (1,000 = 1000).
4. Legal/regulatory terms must preserve original meaning (의무 ≠ 권고, 금지 ≠ 제한, 허용 ≠ 금지).
5. Maximum 5 contradictions (most severe first: direct_negation > numeric_conflict > intent_core_mismatch > nuance > unsupported_claim).
## Output Schema
{
"contradictions": [
{
"type": "direct_negation" | "numeric_conflict" | "intent_core_mismatch" | "nuance" | "unsupported_claim",
"severity": "critical" | "minor",
"claim": "answer 내 해당 구절 (50자 이내)",
"evidence_ref": "대응 근거 내용 (50자 이내, [n] 포함)",
"explanation": "모순 이유 (한국어, 30자 이내)"
}
],
"verdict": "clean" | "minor_issues" | "major_issues"
}
severity mapping:
- direct_negation → "critical"
- All others → "minor"
If no contradictions: {"contradictions": [], "verdict": "clean"}
## Answer
{answer}
## Evidence
{numbered_evidence}