docs(eval): Phase 2Q Diagnose Phase 4 — decision tree md + Apply PR 백로그

phase-2q-query-rewrite-diagnose.md v6 plan §7 Phase 4 closure. Phase 3 commit a41adb6 의 3 측정 결과 + 4 factor weighted decision. decision = H1 (both backends NDCG net 개선 ≥ +0.26): - 추천 Apply LLM = cand_multi_query_macmini (gemma-4) - 사유: F3 ⭐ 24/7 가동 + F1 NDCG 0.927 dominant + F4 cold latency 우세 - 대안: qwen (mixed/english 강점 + MacBook always-on 의향 시) 산출물: - tests/search_eval/baselines/v0_2_phase2q_decision_2026-05-24.md (180 lines) · §1 결정 요약 / §2 측정 표 / §3 카테고리 회복 / §4 4-factor weighted · §5 분석 노트 5건 (multi-query 효과 / variants 구성 / cache hit / Recall 회귀 / Phase 3 incident) · §6 closure gate (branch close 사용자 결정 보류) · §7 follow-up PR 백로그: Apply 1 + 별 chore 2 + Extended 4 + Cloud 1 + Cleanup 1 · §9 사용자 검토 항목 5건 Phase 2Q Diagnose closure 완료. Apply PR 진입 = 사용자 LLM 선택 + sequencing 결정 후. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:57:48 +00:00
parent a41adb63a0
commit c57e4c52dc
1 changed files with 180 additions and 0 deletions
@@ -0,0 +1,180 @@
+# Phase 2Q Diagnose Decision — Multi-Query Rewrite (2026-05-24)
+
+**Branch**: `feat/phase-2q-query-rewrite-diagnose`
+**Commits**: `446ba82` (Phase 1A fixture) → `3e6866b` (Phase 1B scaffold) → `ecd2350` (Phase 2 retrieval 합성) → `a41adb6` (Phase 3 fix + 3 측정)
+**Plan**: `~/.claude/plans/phase-2q-query-rewrite-diagnose.md` v6
+**Snapshot**: `v0_2_phase2a_baseline_snapshot_2026-05-23.json` (doc_id_max=25180, chunk_id_max=56526)
+**Eval set**: 51 cases (46 scored + 5 failure_expected)
+
+---
+
+## 1. 결정 요약
+
+**Decision = H1** (둘 다 net 개선 ≥ +0.03 NDCG, plan v6 §7 Phase 4 분기 H1 매칭).
+
+**추천 Apply LLM = `cand_multi_query_macmini` (Mac mini gemma-4-26b-a4b-it-8bit)**.
+
+사유 = 4 factor 의 weighted 평가에서 gemma 가 우세 또는 동등 (§4 참조). 단 qwen 의 mixed/english 강점이 본 corpus 의 약점 카테고리와 일치 — 사용자 검토 후 변경 가능.
+
+---
+
+## 2. 측정 결과 (3 candidate × cold/warm = 5 run)
+
+| Candidate | NDCG | Δ baseline | Recall t≥2 | Recall t≥3 | p50 cold | p95 cold | p50 warm | p95 warm |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|
+| **baseline_rebaseline** (single-query) | **0.659** | 0.000 ✅ | 0.695 | 0.761 | 478 | 1627 | — | — |
+| **cand_multi_query_macmini** (gemma-4) | **0.927** | **+0.268** | 0.687 | 0.728 | 2757 | 9684 | 998 | 2693 |
+| **cand_multi_query_macbook** (qwen3.6) | **0.919** | **+0.260** | 0.697 | 0.728 | 3647 | 5202 | 873 | 2901 |
+
+**baseline 회귀 0 PASS** = Phase 2A baseline NDCG 0.659 = Phase 2Q baseline 0.659, diff 0.000 < 0.005 threshold. Phase 2 retrieval 합성 path 의 baseline 회귀 invariant 확정 (`rewrite_backend=None` → single-query path 100% 그대로).
+
+**Recall 미세 회귀** = Recall@10 t≥3 둘 다 -0.033 (0.761 → 0.728). multi-query unified RRF 의 top-3 hit 손실. NDCG +0.27 의 trade-off 로 acceptable, 단 Apply PR 에서 reranker tuning 검토 가치 있음 (§5.4).
+
+---
+
+## 3. 카테고리별 회복 (핵심 약점 → 모두 회복)
+
+| Category | n | baseline | macmini | Δ | macbook | Δ | 강점 |
+|---|---:|---:|---:|---:|---:|---:|---|
+| **mixed** ⭐ | 10 | 0.39 | 0.57 | **+0.18** | **0.65** | **+0.26** | **qwen** |
+| **korean_only** ⭐ | 9 | 0.51 | **0.71** | **+0.20** | 0.67 | +0.16 | **gemma** |
+| **standards** | 11 | 0.87 | **1.44** | **+0.57** | 1.31 | +0.44 | **gemma** |
+| **exam** | 7 | 0.74 | **1.11** | **+0.37** | 1.04 | +0.30 | **gemma** |
+| **english_only** | 9 | 0.78 | 0.77 | -0.01 | **0.89** | **+0.11** | **qwen** |
+
+**관찰**:
+- baseline 측정에서 박제된 top 2 약점 (**mixed 0.39 / korean_only 0.51**) 둘 다 두 backend 에서 net 개선. Phase 2Q 가설 (LLM-driven multi-query expansion 으로 korean/mixed 약점 보완) 확정.
+- standards/exam 의 graded NDCG > 1.0 은 graded relevance 평가의 ideal DCG 정규화 quirk (Phase 2A/2B 동일 metric, 비교는 valid). 사용자 체감 metric = 카테고리 ranking 일관성 + Recall.
+- english_only 는 baseline 이 이미 강함 (0.78) — gemma 약간 회귀 (-0.01) / qwen 개선 (+0.11). qwen 의 영어 sampling 강점.
+
+---
+
+## 4. 4-factor weighted decision
+
+| Factor | Weight | macmini gemma-4 | macbook qwen3.6 | 우위 |
+|---|---:|---|---|---|
+| **(F1) Overall NDCG** | 0.30 | 0.927 | 0.919 | 동등 (diff 0.008 = noise) |
+| **(F2) Category 분포** | 0.20 | standards 1.44 / exam 1.11 / korean 0.71 우세 | mixed 0.65 / english 0.89 / recall_t2 0.697 우세 | 카테고리 트레이드오프 |
+| **(F3) Availability / 운영 안정** | 0.30 | ⭐ **24/7 가동** (config.yaml primary, semaphore=1, Mac mini = AI 가공 공장 owner) | RunAtLoad=false, MacBook 사용자 lap-top (lid close 시 죽음, `launchctl start` 수동) | **gemma** |
+| **(F4) Latency** | 0.20 | cold p50 2757ms (gemma 우세, -890ms) / warm p50 998ms | cold p50 3647ms / warm p50 873ms (qwen 우세 -125ms, noise) | cold gemma / warm 동등 |
+| **Cost** | (0) | self-hosted | self-hosted | 동등 (둘 다 자유) |
+
+**Weighted score**:
+- gemma: F1 dominant 0.927 + F2 standards/exam/korean (도메인 중심) + F3 ⭐ 24/7 + F4 cold 우세 = **3 강 1 동등**
+- qwen: F1 0.919 (-0.008) + F2 mixed/english (보조 강점) + F3 ❌ on-demand + F4 warm 우세 (noise) = **1 강 2 동등 1 약**
+
+→ **추천 = gemma** (F3 결정적 — Apply PR 의 production 24/7 가동 invariant).
+
+**qwen 으로 변경할 case**: 사용자가 mixed crosslingual (0.65) + english (0.89) 회복을 최우선 가치로 판단 + MacBook always-on (caffeinate + lid open) 유지 의향. 본 PR scope 외 (Apply PR 의 별 결정).
+
+---
+
+## 5. 분석 노트
+
+### 5.1. multi-query 의 효과 = **모든 카테고리 동시 회복**
+
+다른 Phase 의 swap (Phase 2A embedding / Phase 2B reranker) 는 1개 약점만 회복 또는 다른 약점 회귀의 trade-off. Phase 2Q multi-query 는 **5/5 카테고리 동시 회복** (english 만 gemma 미세 회귀). Recall t≥3 약간 손실 외 회귀 없음.
+
+### 5.2. variants 구성 (3개) 의 효과
+
+prompt v1 의 3-variant 정책 (원본 + Korean rephrase + English translation) 가 cross-lingual + 동의어 augmentation 양쪽 동시 적용 효과. variant 별 K=16 + RRF k=60 → unified 60 cap → reranker batch 1회. plan v6 §5.5 의 A1 (per-variant K=PRODUCTION_TOPK//N) 결정 후속 검증 = latency 회귀 controlled (cold p50 2757ms 단 user lock 정도, 단 production rollout 시 cache prewarm 정책 필수).
+
+### 5.3. cache hit 효과 (warm 측정)
+
+| Backend | cold p50 | warm p50 | Δ | hit speedup |
+|---|---:|---:|---:|---:|
+| macmini gemma-4 | 2757 | 998 | -1759ms | **-64%** |
+| macbook qwen3.6 | 3647 | 873 | -2774ms | **-76%** |
+
+cache deterministic (NDCG cold == warm), latency 만 회복. production cache prewarm 정책 = nightly cron 으로 top-N popular query rewrite cache 채움 → 사용자 첫 request 부터 warm path.
+
+### 5.4. Recall t≥3 미세 회귀 (-0.033)
+
+원인 가능성:
+1. **multi-query unified RRF 의 top-3 hit 손실** — variant 별 rank 1 이 다른 doc 이면 RRF 합산 후 top-3 가 흩어짐
+2. **reranker 입력 chunks 증가로 인한 noise** — variant 별 chunks_by_doc merge 시 unique chunk 다양성 증가 → reranker 가 본래 top-3 분간 약함
+3. **rerank 413 Payload Too Large fallback 다수** — RRF fallback path 사용 시 reranker 영향 없음 → unified RRF score 만으로 top-3 결정 (분산 큰 영향)
+
+→ Apply PR 전 `PR-2Q-Rerank-Payload-Fix` 별 chore 필수 (§7).
+
+### 5.5. Phase 3 incident — fixture-first call shape 위반
+
+**1차 cold 측정 NDCG 0.033 catastrophic**. root cause = `_call_llm` 가 user 메시지 1개에 prompt template 전체 박음 → LLM 이 actual query 인식 못 함 → 모든 query 에 동일 default response (`압력용기 설계 기준`) 반환.
+
+진단 = fastapi log `[rewrite-variant]` 박제에서 query 별 같은 variants 발견.
+
+fix = `_call_llm` system/user 메시지 분리 (fixture invariant). regression test 2 추가.
+
+학습 = [[feedback_fixture_first_call_shape]] (신규 메모리). fixture 박제 시 sampling/timeout 만 align 부족 — request_body 의 messages 구조 (system/user 분리) 까지 production 호출과 단일 source-of-truth. unit test 에 fixture call shape regression 필수.
+
+---
+
+## 6. Closure gate
+
+- [x] G0 fixture 2건 (gemma-4 + qwen) commit (`446ba82`)
+- [x] Phase 2A snapshot 재사용 (snapshot id 25180/56526 일치)
+- [x] baseline rebaseline NDCG diff < 0.005 (0.000 PASS)
+- [x] cand_multi_query_macmini cold + warm 박제 (5 csv + json)
+- [x] cand_multi_query_macbook cold + warm 박제 (5 csv + json)
+- [x] decision tree md 박제 + 4 분기 결정 (H1 확정) 명시
+- [x] Follow-up PR 백로그 박제 (§7)
+- [ ] commit + branch close — 사용자 결정 (`feat/phase-2q-query-rewrite-diagnose` main merge OR Apply PR 진입 후 close)
+
+---
+
+## 7. Follow-up PR 백로그
+
+### Apply 트랙 (LLM 선택 후 진입)
+
+- **`PR-2Q-Apply-Query-Rewrite-1`** — production rollout. LLM = gemma (추천) 또는 qwen.
+  - 진입 전 sequencing 합의: Phase 2 QueryAnalyzer 가동 결정과 Phase 2Q 의 2단 적용 충돌 가능 ([[project_search_v2]] Phase 2 운영 관찰 ask_events 0건 — QueryAnalyzer 가 retrieval path 영향 0 확정. Phase 2Q 와 충돌 없음).
+  - rewrite_backend default = `null` 유지 (opt-in flag only) → 1 주 관찰 후 default ON 검토.
+  - 운영 metric: rewrite cache hit rate / LLM latency p50/p95 / 503 누적 / Recall@10 t≥3 미세 회귀 monitoring.
+
+### 별 chore 트랙 (Apply 전 또는 병행)
+
+- **`PR-2Q-Rerank-Payload-Fix`** — 413 Payload Too Large 다수 관찰 (RRF fallback 작동, NDCG 영향 0). chunks_by_doc merge 의 chunk 중복 (variant 별 same chunk) → reranker payload 폭발. 후보 fix:
+  1. chunk dedup (chunk_id 기준) before reranker input 구성
+  2. reranker batch cap 강제 (현 MAX_RERANK_INPUT=200 → 60 또는 100)
+  3. TEI reranker batch_size 또는 max_input_length 환경변수 조정
+  - plan `phase-2q-rerank-413-fix.md` 별 작성.
+
+- **`PR-2Q-Cache-Prewarm`** — production cold p95 9684ms 회복. nightly cron 으로 top-N popular query (search_failure_logs 또는 user query history 기반) rewrite cache 채움. lazy 도 옵션 (사용자 첫 request 시 background prewarm).
+
+### Extended 트랙 (Apply 후 또는 별 PR)
+
+- **`PR-2Q-Extended-Translation`** — variant 1/2 중 한쪽을 translation 전용으로 분리 (3-variant 정책 유지, 각 variant role 분명화).
+- **`PR-2Q-Extended-HyDE`** — Hypothetical Document Embedding (LLM 이 정답 가설 문서 생성 → embedding → retrieval).
+- **`PR-2Q-Extended-Decomposition`** — query decomposition (compound query → sub-query N개 → 각 retrieval → 합성).
+- **`PR-2Q-Extended-SynonymDict`** — 도메인 사전 (ASME/KGS/가스기사) augmentation, LLM 우회 deterministic path.
+
+### Cloud 트랙 (scaffold-first invariant, [[feedback_scaffold_first_for_external_cost_pr]])
+
+- **`PR-2Q-Cloud-Rewrite-Scaffold-1`** — Claude API / OpenAI API 등 cloud LLM 추가. scaffold-only PR (slot + explicit 503, 실비/secret 0). 활성 PR 별 분리 (`PR-2Q-Cloud-Activation-1`).
+
+### Cleanup (본 PR 종료 후 1주)
+
+- **`PR-2Q-Cleanup-1`** — 측정 후보 코드 / log 정리. _CACHE 잔재 검증.
+
+---
+
+## 8. 메트릭 referencing
+
+| 출처 | 파일 |
+|---|---|
+| raw eval output (5 run) | `reports/v0_2_phase2q_*.csv` |
+| 측정 요약 + incident 박제 | `tests/search_eval/baselines/v0_2_phase2q_results_2026-05-24.json` |
+| Phase 2A baseline reference | `tests/search_eval/baselines/v0_2_phase2a_baseline_snapshot_2026-05-23.json` |
+| 본 decision md | `tests/search_eval/baselines/v0_2_phase2q_decision_2026-05-24.md` |
+| plan | `~/.claude/plans/phase-2q-query-rewrite-diagnose.md` v6 |
+| commits | `446ba82` / `3e6866b` / `ecd2350` / `a41adb6` |
+
+---
+
+## 9. 사용자 검토 항목 (Apply PR 진입 전)
+
+1. **LLM 선택 확정** — gemma (추천, 24/7 + standards/exam/korean 강점) vs qwen (mixed/english 강점, MacBook always-on 의향 시).
+2. **Apply rollout 정책** — default `null` 유지 1주 vs 즉시 default ON. 운영 sequencing.
+3. **PR-2Q-Rerank-Payload-Fix 우선순위** — Apply 전 필수 vs 병행 가능 결정.
+4. **UB-2 caffeinate 종료** — PID 37361 + `launchctl bootout gui/$UID/com.user.mlx-vlm-server` 측정 종료 후 사용자 kill.
+5. **Branch close 정책** — `feat/phase-2q-query-rewrite-diagnose` main merge 시점 (Apply PR 진입 후 vs Diagnose closure 시).