Files
hyungi_document_server/app/services/search/evidence_service.py
T
Hyungi Ahn 34f79f84f2 feat(search): B-2 evidence LLM → 4B triage 전환 + answerability 컬럼
Plan 본래 의도: 근거 선별은 4B, 합성은 26B.

- evidence_service: LLM 호출을 primary(26B MLX) → triage(4B Ollama) 로 전환.
  Ollama concurrent 가능하므로 get_mlx_gate() 제거. synthesis 는 여전히
  llm_gate Semaphore(1) 경유로 MLX 보호.
- prompt_version v3-evidence-triage bump (synthesis 프롬프트 자체는 v2-600char
  그대로, evidence LLM 경로 변경을 분리 추적).
- migrations 161/162: analyze_events 에 answerability / partial_basis /
  suggested_query_count 컬럼 + partial index. /ask 는 이미 ask_events 에
  completeness (full/partial/insufficient) 기록 운영 중이므로, analyze_events
  쪽은 향후 문서 분석에서 answerability 개념 도입 시 활용 예비.
- telemetry record_analyze_event 에 answerability / partial_basis /
  suggested_query_count 파라미터 확장.

기존 /ask 3-state completeness 로직 (classifier_service + 7-tier gate) 은
그대로 유지 — 이미 Phase 3.5a 에서 완성된 상태. B-2 는 LLM 부하 재분배와
관측성 확장에 집중.

MLX 부하 감소 효과: 이전엔 쿼리 1건당 evidence(26B) + synthesis(26B) 2번
MLX 호출. 이제는 evidence(4B Ollama) + synthesis(26B MLX) 로 MLX 호출 절반.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 10:33:32 +09:00

468 lines
18 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
"""Evidence extraction 서비스 (Phase 3.2).
reranker 결과 chunks 에서 query-relevant span 을 구조적으로 추출한다.
## 설계 (EV-A: Rule + LLM span select)
```
reranked results
[rule filter] score >= 0.25, max_per_doc=2, top MAX_EVIDENCE_CANDIDATES
[snippet 재윈도우] _extract_window(full, query, 800) — LLM 입력용
[1 batched LLM call] gemma-4 via get_mlx_gate() (single inference)
[post-process]
- relevance >= 0.5 필터
- span too-short (< 80자) → _extract_window(full, query, 120) 로 재확장
- span too-long (> 300자) → cut
- doc-group ordering (검색 결과 doc 순서 유지, doc 내부만 relevance desc)
- n 재부여 (1..N)
EvidenceItem 리스트
```
## 영구 룰
- **LLM 호출은 1번만** (batched). 순차 호출 절대 금지.
- **B-2 변경**: evidence 추출은 triage(4B Ollama) 로 전환 — Ollama 는 concurrent
OK 라 `get_mlx_gate()` 불필요. primary(26B MLX) 는 synthesis 전용 보호.
- 기존 analyzer / synthesis 의 `get_mlx_gate()` 공유는 유지 — 26B 경로에만 적용.
- **fallback span 도 query 중심 window**. `full_snippet[:200]` 같은 "앞에서부터
자르기" 절대 금지. 조용한 품질 붕괴 (citation 은 멀쩡한데 실제 span 이 query
와 무관) 대표 사례.
- **Span too-short 보정 필수**: `len(span) < 80` 이면 자동 확장. "짧을수록
정확" 이 아니라 **짧으면 위험** — synthesis LLM 이 문맥 부족으로 이어 만들기
(soft hallucination) 를 한다.
- **Evidence ordering 은 doc-group 유지**. 전역 relevance desc 정렬 금지.
answer 는 [1][2][3] 순서로 생성되고 그 순서가 문맥 흐름을 결정한다.
## 확장 여지 (지금은 비활성)
`EVIDENCE_FAST_PATH_THRESHOLD` 가 `None` 이 아니고 `results[0].rerank_score >=
THRESHOLD` 이면 LLM 호출 스킵 후 rule-only 경로로 즉시 반환. Activation 조건:
(1) evidence LLM 호출 비율 > 80%, (2) /ask 평균 latency > 15s, (3) rerank
top1 p50 > 0.75. 셋 다 충족해야 켠다.
"""
from __future__ import annotations
import asyncio
import re
import time
from dataclasses import dataclass, field
from typing import TYPE_CHECKING
from ai.client import AIClient, _load_prompt, parse_json_response
from core.utils import setup_logger
from .rerank_service import _extract_window
if TYPE_CHECKING:
from api.search import SearchResult
logger = setup_logger("evidence")
# ─── 상수 (plan 영구 룰) ─────────────────────────────────
EVIDENCE_MIN_RERANK = 0.25 # 1차 rule cut — rerank score 이 미만은 제외
MAX_EVIDENCE_CANDIDATES = 6 # LLM 입력 상한
MAX_PER_DOC = 2
CANDIDATE_SNIPPET_CHARS = 1200 # LLM 이 볼 원문 창 크기 (800→1200, 6×1200자≈2880tok < gemma-4 4096 예산)
MIN_EVIDENCE_FOR_SYNTHESIS = 3 # sparse evidence supplement 임계값
MIN_RELEVANCE_KEEP = 0.5 # LLM 출력 필터
SPAN_MIN_CHARS = 80 # 이 미만이면 window enlarge
SPAN_ENLARGE_TARGET = 120 # enlarge 시 재윈도우 target_chars
SPAN_MAX_CHARS = 300 # 이 초과면 cut (synthesis token budget 보호)
LLM_TIMEOUT_MS = 15000
PROMPT_VERSION = "v2-triage" # B-2: primary(26B MLX) → triage(4B Ollama) 전환
# 확장 여지 — None 이면 비활성 (baseline). 실측 후 0.8 등으로 켠다.
EVIDENCE_FAST_PATH_THRESHOLD: float | None = None
# ─── 반환 타입 ───────────────────────────────────────────
@dataclass(slots=True)
class EvidenceItem:
"""LLM 또는 rule fallback 이 추출한 단일 evidence span.
n 은 doc-group ordering + relevance 정렬 후 1부터 재부여된다.
`full_snippet` 은 **synthesis 프롬프트에 절대 포함 금지** — debug / citation
원문 보기 전용.
"""
n: int # 1-based, synthesis 프롬프트의 [n] 과 매핑
chunk_id: int | None
doc_id: int
title: str | None
section_title: str | None
span_text: str # LLM 추출 (또는 rule fallback) span, 80~300자
relevance: float # LLM 0~1 (fallback 시 rerank_score 복사)
rerank_score: float # raw reranker 점수
full_snippet: str # 원본 800자 (debug/citation 전용, synthesis 금지)
source: str = "llm" # "llm" | "supplement" | "rule_fallback"
# ─── 프롬프트 로딩 (module 초기화 1회) ───────────────────
try:
EVIDENCE_PROMPT = _load_prompt("evidence_extract.txt")
except FileNotFoundError:
EVIDENCE_PROMPT = ""
logger.warning(
"evidence_extract.txt not found — evidence_service will always use rule-only fallback"
)
# ─── Helper: candidates → LLM 입력 블록 ──────────────────
def _build_numbered_candidates(
candidates: list["SearchResult"], query: str
) -> tuple[str, list[str]]:
"""LLM 프롬프트의 {numbered_candidates} 블록 + 재윈도우된 full_snippet 리스트.
Returns:
(block_str, full_snippets) — full_snippets[i] 는 1-based n=i+1 의 원문
"""
lines: list[str] = []
full_snippets: list[str] = []
for i, c in enumerate(candidates, 1):
title = (c.title or "").strip()
raw_text = c.snippet or ""
full = _extract_window(raw_text, query, target_chars=CANDIDATE_SNIPPET_CHARS)
full_snippets.append(full)
lines.append(f"[{i}] title: {title} / text: {full}")
return "\n".join(lines), full_snippets
# ─── Helper: span length 보정 ───────────────────────────
def _normalize_span(span: str, full: str, query: str) -> tuple[str, bool]:
"""span 을 SPAN_MIN_CHARS ~ SPAN_MAX_CHARS 범위로 보정.
Returns:
(normalized_span, was_expanded)
- was_expanded=True 이면 "short_span_expanded" 로그 대상
"""
s = (span or "").strip()
expanded = False
if len(s) < SPAN_MIN_CHARS:
# soft hallucination 방어 — query 중심으로 window 재확장
s = _extract_window(full, query, target_chars=SPAN_ENLARGE_TARGET)
expanded = True
if len(s) > SPAN_MAX_CHARS:
s = s[:SPAN_MAX_CHARS]
return s, expanded
# ─── Helper: doc-group ordering ─────────────────────────
def _apply_doc_group_ordering(
items: list[EvidenceItem],
results: list["SearchResult"],
) -> list[EvidenceItem]:
"""검색 결과 doc 순서 유지 + doc 내부만 relevance desc + n 재부여.
answer 는 [1][2][3] 순서로 생성되고 그 순서가 문맥 흐름을 결정한다.
전역 relevance desc 정렬은 "doc A span1 → doc B span1 → doc A span2"
처럼 튀면서 읽기 이상한 답변을 만든다.
"""
if not items:
return []
doc_order: dict[int, int] = {}
for idx, r in enumerate(results):
if r.id not in doc_order:
doc_order[r.id] = idx
# 정렬: (doc 순서, -relevance)
items.sort(
key=lambda it: (doc_order.get(it.doc_id, 9999), -it.relevance)
)
# n 재부여
for new_n, it in enumerate(items, 1):
it.n = new_n
return items
# ─── Helper: rule-only fallback ─────────────────────────
def _build_rule_only_evidence(
candidates: list["SearchResult"],
full_snippets: list[str],
query: str,
) -> list[EvidenceItem]:
"""LLM 실패/timeout 시 rule-only 경로.
⚠ `full_snippet[:200]` 같은 앞자르기 금지. 반드시 `_extract_window` 로
query 중심 윈도우를 만든다. relevance 는 rerank_score 복사.
"""
items: list[EvidenceItem] = []
for i, (c, full) in enumerate(zip(candidates, full_snippets), 1):
span = _extract_window(full, query, target_chars=200)
# 정규화 (보통 여기서는 SPAN_MIN_CHARS 이상이지만 안전장치)
span, _expanded = _normalize_span(span, full, query)
items.append(
EvidenceItem(
n=i,
chunk_id=c.chunk_id,
doc_id=c.id,
title=c.title,
section_title=c.section_title,
span_text=span,
relevance=float(c.rerank_score or c.score or 0.0),
rerank_score=float(c.rerank_score or c.score or 0.0),
full_snippet=full,
source="rule_fallback",
)
)
return items
# ─── Core: extract_evidence ─────────────────────────────
async def extract_evidence(
query: str,
results: list["SearchResult"],
ai_client: AIClient | None = None,
) -> tuple[list[EvidenceItem], str | None]:
"""reranked results → EvidenceItem 리스트.
Returns:
(items, skip_reason)
skip_reason ∈ {None, "empty_retrieval", "all_low_rerank", "fast_path",
"llm_timeout_fallback_rule", "llm_error_fallback_rule",
"parse_failed_fallback_rule", "all_llm_rejected"}
- skip_reason 이 None 이 아니어도 items 는 비어있지 않을 수 있다
(fallback/fast_path 경로).
"""
if not results:
return [], "empty_retrieval"
# ── 1차 rule filter: rerank_score >= EVIDENCE_MIN_RERANK + max_per_doc ──
candidates: list["SearchResult"] = []
per_doc: dict[int, int] = {}
for r in results:
raw_score = r.rerank_score if r.rerank_score is not None else r.score
if raw_score is None or raw_score < EVIDENCE_MIN_RERANK:
continue
if per_doc.get(r.id, 0) >= MAX_PER_DOC:
continue
candidates.append(r)
per_doc[r.id] = per_doc.get(r.id, 0) + 1
if len(candidates) >= MAX_EVIDENCE_CANDIDATES:
break
if not candidates:
return [], "all_low_rerank"
# ── Fast-path (현재 비활성) ─────────────────────────
if EVIDENCE_FAST_PATH_THRESHOLD is not None:
# ⚠ display score 가 아니라 raw rerank_score 로 판단.
# normalize_display_scores 를 거친 r.score 는 frontend 용 리스케일
# 값이라 distribution drift 가능. fast-path 는 reranker raw 신호가 안전.
top_rerank = (
results[0].rerank_score if results[0].rerank_score is not None else 0.0
)
if top_rerank is not None and top_rerank >= EVIDENCE_FAST_PATH_THRESHOLD:
_block, full_snippets = _build_numbered_candidates(candidates, query)
items = _build_rule_only_evidence(candidates, full_snippets, query)
items = _apply_doc_group_ordering(items, results)
logger.info(
"evidence fast_path query=%r candidates=%d kept=%d top_rerank=%.2f",
query[:80], len(candidates), len(items), top_rerank,
)
return items, "fast_path"
# ── LLM 호출 준비 ───────────────────────────────────
if not EVIDENCE_PROMPT:
# 프롬프트 미로딩 → rule-only
_block, full_snippets = _build_numbered_candidates(candidates, query)
items = _build_rule_only_evidence(candidates, full_snippets, query)
items = _apply_doc_group_ordering(items, results)
logger.warning(
"evidence prompt_not_loaded → rule fallback query=%r kept=%d",
query[:80], len(items),
)
return items, "llm_error_fallback_rule"
block, full_snippets = _build_numbered_candidates(candidates, query)
prompt = EVIDENCE_PROMPT.replace("{query}", query).replace(
"{numbered_candidates}", block
)
client_owned = False
if ai_client is None:
ai_client = AIClient()
client_owned = True
t_start = time.perf_counter()
raw: str | None = None
llm_error: str | None = None
try:
# B-2: evidence 추출은 4B triage (Ollama concurrent OK) — MLX gate 경유 불필요.
# primary(26B) 는 synthesis 전용으로 MLX gate 보호.
async with asyncio.timeout(LLM_TIMEOUT_MS / 1000):
raw = await ai_client.call_triage(prompt)
except asyncio.TimeoutError:
llm_error = "timeout"
except Exception as exc:
llm_error = f"llm_error:{type(exc).__name__}"
finally:
if client_owned:
try:
await ai_client.close()
except Exception:
pass
elapsed_ms = (time.perf_counter() - t_start) * 1000
# ── LLM 실패 → rule fallback ────────────────────────
if llm_error is not None:
items = _build_rule_only_evidence(candidates, full_snippets, query)
items = _apply_doc_group_ordering(items, results)
logger.warning(
"evidence LLM %s → rule fallback query=%r candidates=%d kept=%d elapsed_ms=%.0f",
llm_error, query[:80], len(candidates), len(items), elapsed_ms,
)
return items, "llm_timeout_fallback_rule" if llm_error == "timeout" else "llm_error_fallback_rule"
parsed = parse_json_response(raw or "")
if not isinstance(parsed, dict) or not isinstance(parsed.get("items"), list):
items = _build_rule_only_evidence(candidates, full_snippets, query)
items = _apply_doc_group_ordering(items, results)
logger.warning(
"evidence parse_failed → rule fallback query=%r raw=%r elapsed_ms=%.0f",
query[:80], (raw or "")[:200], elapsed_ms,
)
return items, "parse_failed_fallback_rule"
# ── LLM 출력 파싱 ──────────────────────────────────
short_span_expanded = 0
llm_items: list[EvidenceItem] = []
for entry in parsed["items"]:
if not isinstance(entry, dict):
continue
try:
n_raw = int(entry.get("n", 0))
except (TypeError, ValueError):
continue
if n_raw < 1 or n_raw > len(candidates):
continue
try:
relevance = float(entry.get("relevance", 0.0) or 0.0)
except (TypeError, ValueError):
relevance = 0.0
if relevance < MIN_RELEVANCE_KEEP:
continue
span_raw = entry.get("span")
if not isinstance(span_raw, str) or not span_raw.strip():
continue
candidate = candidates[n_raw - 1]
full = full_snippets[n_raw - 1]
span, expanded = _normalize_span(span_raw, full, query)
if expanded:
short_span_expanded += 1
llm_items.append(
EvidenceItem(
n=n_raw, # doc-group ordering 에서 재부여됨
chunk_id=candidate.chunk_id,
doc_id=candidate.id,
title=candidate.title,
section_title=candidate.section_title,
span_text=span,
relevance=relevance,
rerank_score=float(
candidate.rerank_score
if candidate.rerank_score is not None
else (candidate.score or 0.0)
),
full_snippet=full,
)
)
# ── LLM 이 전부 reject → rule fallback ──────────────
if not llm_items:
items = _build_rule_only_evidence(candidates, full_snippets, query)
items = _apply_doc_group_ordering(items, results)
logger.warning(
"evidence all_llm_rejected → rule fallback query=%r elapsed_ms=%.0f",
query[:80], elapsed_ms,
)
return items, "all_llm_rejected"
# ── Sparse evidence supplement (Phase 3.5b) ────────
# dead zone 해소: LLM kept 1~2 + candidates 충분 → rule-only 보충
supplement_skip = None
if 0 < len(llm_items) < MIN_EVIDENCE_FOR_SYNTHESIS and len(candidates) >= MIN_EVIDENCE_FOR_SYNTHESIS:
llm_n_set = {it.n for it in llm_items}
supplement_count = MIN_EVIDENCE_FOR_SYNTHESIS - len(llm_items)
supplemented = 0
# substring + critical token 필터 준비
query_tokens = re.findall(r'[가-힣]{2,}|[a-zA-Z]{3,}', query)
_IMPORTANT_SUFFIXES = {"조건", "기준", "요건", "처벌", "벌칙",
"정의", "차이", "절차", "방법", "계산"}
critical_tokens = [
t for t in query_tokens
if len(t) >= 3 or any(s in t for s in _IMPORTANT_SUFFIXES)
]
for idx, (c, full) in enumerate(zip(candidates, full_snippets), 1):
if idx in llm_n_set or supplement_count <= 0:
continue
span = _extract_window(full, query, target_chars=200)
span, _ = _normalize_span(span, full, query)
# substring match (recall)
has_match = any(qt in span for qt in query_tokens)
# critical token check (precision)
has_critical = (
any(ct in span for ct in critical_tokens)
if critical_tokens else has_match
)
if not (has_match and has_critical):
continue
llm_items.append(
EvidenceItem(
n=idx,
chunk_id=c.chunk_id,
doc_id=c.id,
title=c.title,
section_title=c.section_title,
span_text=span,
relevance=float(c.rerank_score or c.score or 0.0) * 0.8,
rerank_score=float(c.rerank_score or c.score or 0.0),
full_snippet=full,
source="supplement",
)
)
supplemented += 1
supplement_count -= 1
if supplemented > 0:
supplement_skip = "sparse_evidence_supplemented"
logger.info(
"evidence sparse_supplement query=%r llm_kept=%d supplemented=%d total=%d",
query[:80], len(llm_items) - supplemented, supplemented, len(llm_items),
)
# ── doc-group ordering + n 재부여 ───────────────────
llm_items = _apply_doc_group_ordering(llm_items, results)
logger.info(
"evidence ok query=%r candidates=%d kept=%d short_span_expanded=%d elapsed_ms=%.0f",
query[:80], len(candidates), len(llm_items), short_span_expanded, elapsed_ms,
)
return llm_items, supplement_skip