Files
hyungi_document_server/app/services/search
hyungi a41adb63a0 fix(search): Phase 2Q variants bug fix + Phase 3 3 measurement 박제
Phase 3 cold 측정 1차에서 NDCG 0.033 catastrophic 발견 — 모든 query 에 동일 variants
반환. root cause = _call_llm 이 user 메시지 1개에 prompt template 전체 박음. LLM 이
actual query 인식 못 함. fixture request_body 형식 (system=prompt / user=query) 과
mismatch. fixture-first invariant 위반.

fix:
- app/services/search/query_rewriter.py _call_llm — system/user 메시지 분리.
  fixture request_body 와 단일 source-of-truth. _render_prompt 는 [deprecated] 유지.
- tests/test_query_rewriter.py — Phase 3 regression test 2:
  · _call_llm 가 system + user 분리 호출 verify (httpx.AsyncClient monkeypatch)
  · qwen backend = response_format 미사용 verify
- 32/32 unit test PASS.

Phase 3 측정 (fix 후 재측정, 51 case × 3 candidate × cold/warm = 5 run):
- baseline_rebaseline (rewrite_backend=null): NDCG 0.659 = Phase 2A 0.659, diff 0.000 PASS
- cand_multi_query_macmini cold: NDCG 0.927 (Δ +0.268), p50 2757ms / p95 9684ms
- cand_multi_query_macmini warm: NDCG 0.927 동일, p50 998ms (cache hit -64%)
- cand_multi_query_macbook cold: NDCG 0.919 (Δ +0.260), p50 3647ms / p95 5202ms
- cand_multi_query_macbook warm: NDCG 0.919 동일, p50 873ms (cache hit -76%)

핵심 약점 회복 (gemma / qwen):
- mixed 0.39 → 0.57 / 0.65
- korean_only 0.51 → 0.71 / 0.67
- standards 0.87 → 1.44 / 1.31
- exam 0.74 → 1.11 / 1.04

decision = H1 (both backends 유의미 net 개선). LLM 선택 = Phase 4 decision md 별 step.

산출물:
- reports/v0_2_phase2q_*.csv (5 raw run_eval output)
- tests/search_eval/baselines/v0_2_phase2q_results_2026-05-24.json (요약 + incident 박제)

follow-up:
- rerank 413 Payload Too Large 다수 관찰 (RRF fallback 작동, NDCG 영향 없음). Apply PR 전 별 chore — chunk dedup 또는 reranker batch cap 검토.
- p95 cold 9684ms 매우 큼. production rollout 시 cache prewarm 정책 필수.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:51:56 +00:00
..