docs(search): Phase 2Q closed as evaluated experiment (deprecated, not recommended for production)

사용자 결정 (2026-05-24, measurement chain 4-layer 정정 완료 후):

> Phase 2Q Query Rewrite is closed as an evaluated experiment.
> After result-level dedup correction, true net gain was marginal
> (NDCG +0.019, Recall t≥2 +0.030) while latency cost was high
> (cold +876%, warm +320%). Therefore, multi-query rewrite is not
> recommended for default production rollout. Keep opt-in path as
> experimental/deprecated reference only; do not proceed to
> Cache-Prewarm unless future real-query evidence shows a stronger gain.

변경:
- docs/phase_2q_apply_opt_in.md: 🛑 DEPRECATED / EXPERIMENTAL status 박제. measurement chain
  정정 history (4-layer) + 진짜 효과 + Phase 2Q 성과 보존.
- app/api/search.py: rewrite_backend query param description 갱신 (⚠️ EXPERIMENTAL/DEPRECATED,
  production 추천 문구 제거, opt-in 실험 reference 만 유지 명시).

5 액션 박제 (사용자 결정):
  1. opt-in 코드 유지 (recommended=false / experimental)
  2. docs/ deprecated 박제
  3. search.py description production 추천 제거
  4. PR-2Q-Cache-Prewarm + PR-2Q-Apply-Default-ON-1 폐기
  5. Extended 4건 중 SynonymDict (deterministic, LLM 우회) 만 별도 후보 보존

신규 feedback memory: [[feedback_measurement_chain_audit]] — Diagnose 측정이 Apply/rollout
결정 기준일 때 retrieval/fusion/rerank/eval 모든 layer audit 필수. Phase 2Q 4-iteration
정정 chain (0.927→0.876→0.641→0.663) origin.

Phase 2Q 성과 (실패가 아닌 좋은 실험):
- chunk_id/doc_id 중복 inflation 발견 + measurement chain audit pattern 확립
- LLM rewrite 는 현재 DS 검색 기본값으로는 ROI 낮음 결론 확보
- search_pipeline 의 multi-query 합성 + 3-layer dedup 인프라 보존 (Extended SynonymDict
  또는 미래 cloud LLM scaffold 재사용 가능)
- 신규 feedback memory 4건: fixture-first-call-shape / apply-prereq-structural-fix /
  graded-ndcg-dedup-invariant / measurement-chain-audit

main 위 직접 commit (read-only docs / API description, retrieval path 영향 0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
hyungi
2026-05-24 04:57:11 +00:00
parent 5e480d6d6e
commit c4a40ab18a
2 changed files with 59 additions and 12 deletions
+6 -5
View File
@@ -179,11 +179,12 @@ async def search(
None,
pattern=r"^(baseline|cand_[a-z0-9_]+)$",
description=(
"Phase 2Q Apply (2026-05-24 진입, opt-in, 1주 관찰). slug-based, no silent fallback. "
"baseline|cand_multi_query_macmini (추천 gemma-4)|cand_multi_query_macbook (qwen3.6). "
"미지정/baseline = single-query path (회귀 0 invariant). "
"변경 후 variant N 별 retrieval+fusion → unified RRF → reranker 1회 (chunk_id dedup + cap 60). "
"docs: docs/phase_2q_apply_opt_in.md"
"⚠️ EXPERIMENTAL / DEPRECATED (Phase 2Q closed 2026-05-24 as evaluated experiment). "
"Result-level dedup 정정 후 net gain marginal (NDCG +0.019, Recall t≥2 +0.030) "
"vs latency cost 큼 (cold +876%, warm +320%). default production rollout 권고 X. "
"slug-based, no silent fallback. baseline|cand_multi_query_macmini|cand_multi_query_macbook. "
"미지정/baseline = single-query path (회귀 0 invariant, 권장 default). "
"opt-in 실험 reference 만 유지 — docs/phase_2q_apply_opt_in.md 의 closed status 참조."
),
),
):
+53 -7
View File
@@ -1,13 +1,59 @@
# Phase 2Q Apply — Multi-Query Rewrite (opt-in, 2026-05-24 진입)
# Phase 2Q Multi-Query Rewrite — ⚠️ DEPRECATED / EXPERIMENTAL (2026-05-24 closed)
## 개요
## 🛑 Status: closed as evaluated experiment
Phase 2Q Diagnose 결과 (decision md `tests/search_eval/baselines/v0_2_phase2q_decision_2026-05-24.md`)
H1 (both backends 유의미 net 개선) 확정 + Rerank-Payload-Fix (commit `b734fc5`) 완료 후
Apply rollout 진입.
> Phase 2Q Query Rewrite is closed as an evaluated experiment.
> After result-level dedup correction, true net gain was marginal
> (NDCG +0.019, Recall t≥2 +0.030) while latency cost was high
> (cold +876%, warm +320%). Therefore, multi-query rewrite is not
> recommended for default production rollout. Keep opt-in path as
> experimental/deprecated reference only; do not proceed to
> Cache-Prewarm unless future real-query evidence shows a stronger gain.
**rollout 정책 = opt-in 1주 관찰** (2026-05-24 ~ 2026-05-31). 1주 후 metric 정상 시
default ON 전환 결정 (별 PR `PR-2Q-Apply-Default-ON-1`).
**opt-in flag `?rewrite_backend=cand_multi_query_macmini` 는 코드 유지 (실험 reference)**.
**production default rollout 권고 X**. PR-2Q-Cache-Prewarm / PR-2Q-Apply-Default-ON-1
폐기. Extended 트랙 중 SynonymDict (deterministic, LLM 우회) 만 별도 후보로 보존.
## 개요 (역사 박제)
Phase 2Q Diagnose 결과 H1 (both backends 유의미 net 개선) 확정 + Rerank-Payload-Fix
완료 후 Apply opt-in 진입 (commit `fef5ddc`). **단 measurement chain 의 다층 inflation
발견 후 정정값 기준 결정 = closed as experiment.**
## 측정 정정 history (모든 inflation 정정)
| Layer | commit | NDCG | inflation 원인 |
|---|---|---:|---|
| Phase 3 | `a41adb6` | 0.927 | chunk_id 중복 누적 |
| Rerank-Fix | `b734fc5` | 0.876 | doc_id 잔재 (chunk dedup 만) |
| Eval-Dedup | `3553573` | 0.641 | eval layer 만 dedup |
| **Result-Dedup (최종)** | **`5e480d6`** | **0.663** | ✅ 0/51 dedup audit 정상 |
**진짜 multi-query 효과** (baseline 0.644 대비):
- NDCG cold +0.019 / warm +0.015 ← sub-noise
- Recall t≥2 cold +0.030 / warm +0.022 ← 소량 개선
- Recall t≥3 0.000 (cold) / -0.022 (warm) ← 동등~약간 회귀
- **latency p50 cold +876% (3692ms) / warm +320% (1588ms)** ← 비용 명확
- 카테고리: english/standards/mixed 소량 우세 / exam/korean 소량 회귀
**multi-query 의 marginal quality 개선이 latency cost + 시스템 복잡도 + LLM 의존 정당화 X**.
## 권고 (사용자 결정 2026-05-24)
**Phase 2Q 자체는 실패가 아닌 좋은 실험**. 성과:
- chunk_id 중복 inflation 발견 (Phase 3 → Rerank-Fix)
- doc_id / result dedup 문제 정리 (Eval-Dedup → Result-Dedup)
- multi-query 의 실제 효과를 정량화 (NDCG +0.019)
- "LLM rewrite 는 현재 DS 검색 기본값으로는 ROI 낮음" 결론 확보
- 신규 feedback 메모리 3건 (fixture-first call shape / apply prereq structural fix /
graded NDCG dedup invariant)
**기능 자체는 deprecated, 교훈과 인프라는 보존**.
## ~~rollout 정책~~ (역사 박제)
이전 결정: opt-in 1주 관찰 ~2026-05-31 → default ON 검토.
**정정 결정 (2026-05-24)**: closed as evaluated experiment, default ON 진행 X.
**추천 LLM = `cand_multi_query_macmini` (gemma-4-26b-a4b-it-8bit, Mac mini)**.
4-factor weighted 사유 (decision md §4):