feat(search): Phase 3 Ask pipeline (evidence + synthesis + /api/search/ask)

- llm_gate.py: MLX single-inference 전역 semaphore (analyzer/evidence/synthesis 공유)
- search_pipeline.py: run_search() 추출, /search 와 /ask 단일 진실 소스
- evidence_service.py: Rule + LLM span select (EV-A), doc-group ordering,
  span too-short 자동 확장(<80자→120자), fallback 은 query 중심 window 강제
- synthesis_service.py: grounded answer + citation 검증 + LRU 캐시(1h/300),
  refused 처리, span_text ONLY 룰 (full_snippet 프롬프트 금지)
- /api/search/ask: 15s timeout, 9가지 failure mode + 한국어 no_results_reason
- rerank_service: rerank_score raw 보존 (display drift 방지)
- query_analyzer: _get_llm_semaphore 를 llm_gate.get_mlx_gate 로 위임
- prompts: evidence_extract.txt, search_synthesis.txt (JSON-only, example 포함)

config.yaml / docker / ollama / infra_inventory 변경 없음.
plan: ~/.claude/plans/quiet-meandering-nova.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hyungi Ahn
2026-04-09 07:34:08 +09:00
parent 120db86d74
commit 64322e4f6f
9 changed files with 1698 additions and 258 deletions

View File

@@ -36,6 +36,8 @@ from ai.client import AIClient, _load_prompt, parse_json_response
from core.config import settings
from core.utils import setup_logger
from .llm_gate import get_mlx_gate
logger = setup_logger("query_analyzer")
# ─── 상수 (plan 영구 룰) ────────────────────────────────
@@ -67,17 +69,16 @@ _CACHE: dict[str, dict[str, Any]] = {}
_PENDING: set[asyncio.Task[Any]] = set()
# 동일 쿼리 중복 실행 방지 (진행 중인 쿼리 집합)
_INFLIGHT: set[str] = set()
# MLX concurrency 제한 (single-inference → 1)
# 첫 호출 시 lazy init (event loop이 준비된 후)
_LLM_SEMAPHORE: asyncio.Semaphore | None = None
def _get_llm_semaphore() -> asyncio.Semaphore:
"""첫 호출 시 현재 event loop에 바인딩된 semaphore 생성."""
global _LLM_SEMAPHORE
if _LLM_SEMAPHORE is None:
_LLM_SEMAPHORE = asyncio.Semaphore(LLM_CONCURRENCY)
return _LLM_SEMAPHORE
"""MLX single-inference gate를 반환. Phase 3.1부터 llm_gate.get_mlx_gate()
로 위임 — analyzer / evidence / synthesis 가 동일 semaphore 공유.
`LLM_CONCURRENCY` 상수는 하위 호환/문서용으로 유지하되, 실제 bound는
`llm_gate.MLX_CONCURRENCY` 가 담당한다.
"""
return get_mlx_gate()
def _cache_key(query: str) -> str: