docs(search): passage-RAG go/no-go = NO-GO (hier evidence 동등, diagnose c4+c5)

PR-DocSrv-Hier-PassageRAG-Diagnose-1 c4+c5. 조건부 N=12(retrieval 통제) blind pairwise (hypothesis-blind subagent, 익명 3-file split). 결과 4-way 수렴 = 동등: pairwise prehier4/hier3/tie5(no edge) + axis ±0.08 + objective 동일(halluc36/36) + variance~0(byte-identical 재생성). verbosity artifact 없음(prehier 더 길었으나 승+1). => NO-GO: hier-leaf evidence 무이득. hier leaf = section-outline UI 전용 완전 확정 (UI yes / doc-search NO-GO / passage-RAG NO-GO 3영역 종결). 2026-06-21 freeze input only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 07:02:46 +00:00
parent 9c039139ef
commit cd33ded7a8
6 changed files with 246 additions and 0 deletions
@@ -0,0 +1,37 @@
+# Passage-RAG go/no-go — Decision (PR-DocSrv-Hier-PassageRAG-Diagnose-1 c5)
+
+**측정일** 2026-05-25 | branch `feat/hier-passagerag-diagnose` | 결정 = 🔴 **NO-GO** (hier-evidence 무이득, 동등)
+
+## 질문
+Replace-Diagnose(doc-search) NO-GO 가 남긴 후속: `/ask` grounded-synthesis 에서 hier-leaf evidence(정밀 짧은 절)가 legacy 윈도우 evidence 보다 **더 나은 답변**을 만드나? (절의 "짧음"이 doc-검색엔 독이었지만 passage-RAG 엔 약일 수 있다는 가설)
+
+## 결정 규칙 (N-적응) 적용
+조건부 **N=12 → ≥12 tier**: pairwise 방향성 신호 유효 + objective 보강. (INCONCLUSIVE 아님 — N 충분)
+
+## 결과 (상세 = passage_rag_judge_2026-05-25.md)
+| 차원 | prehier | hier_sim_clean | 판정 |
+|---|---|---|---|
+| pairwise (N=12) | 4승 | 3승 (tie 5) | edge 없음 (noise) |
+| axis faith/correct/complete | 3.00/2.92/2.75 | 2.92/3.00/2.67 | ±0.08 동등 |
+| objective (halluc/grounding/complete) | 36 / 8·12 / 0full | 36 / 8·11 / 1full | 동일 |
+| run-to-run 분산 | ~0 (byte-identical 재생성) | ~0 | single-sample 신뢰 |
+
+**4-way 수렴 = 동등**. verbosity artifact 없음(prehier 더 길었으나 승수 +1). hier 가설은 답변 품질에서 실현 안 됨.
+
+## 결정: 🔴 NO-GO
+**hier-leaf 를 `/ask` evidence 단위로 채택/증강하지 않는다.** retrieval 통제 시 답변 품질 무차이 → 절 단위 evidence 의 추가 가치 0. 게다가 end-to-end 로는 hier 의 retrieval 열위(exam_005/006·cl_007 타깃 누락 = doc-search NO-GO 일관)가 더해져, passage-RAG 에서도 hier 채택 근거 없음. `PR-...-PassageRAG-Apply` 류 미진입.
+
+## hier leaf 운명 = section-outline UI 전용 **완전 확정**
+3개 평가 영역 모두 종결:
+- **UI (section-outline)**: ✅ 유효 (배포됨, `PR-DocSrv-Hier-Section-UI-1`)
+- **doc-level 검색 코퍼스 교체**: 🔴 NO-GO (Replace-Diagnose, −0.074)
+- **passage-RAG evidence 단위**: 🔴 NO-GO (본 PR, 동등·무이득)
+→ hier leaf 12,697 = **in_corpus=false 영구 잔존, UI 재료 한정**. 검색/RAG 어느 쪽도 아님.
+
+## 한계 박제
+- judge = Claude(Opus) subagent — hypothesis-blind 세션 분리 + label-blind + shuffle + verbosity 차단 + objective 교차로 완화. NDCG 같은 완전 객관 아님 (단 objective 신호가 pairwise 와 동일 결론 → 신뢰 보강).
+- 조건부 N=12 (작은 spike). generator=gemma 고정, temp~0.3 (variance ~0 실측). single cached generation per Q/variant.
+- 측정 = doc-search 통제 후 passage 품질 한정. (다른 generator/더 큰 N 에서 미세 차이 가능성 배제 못하나, 본 spike 의 4-way 수렴은 "유의 이득 없음"에 충분.)
+
+## 제품 input
+2026-06-21 Ask 파이프라인 freeze 결정의 bounded diagnostic signal: **"hier 로 Ask 품질 개선 안 됨"** — Ask 투자 정당화에 hier 카드 없음. 제품 rollout 결정 아님 ([[document-server-2026-05-21]]).
@@ -0,0 +1,72 @@
+#!/usr/bin/env python3
+"""c4: conditional subset + objective signals + anonymized 3-file split."""
+import json, random, os
+os.chdir(os.path.expanduser("~/Documents/code/hyungi_Document_Server"))
+recs = [json.loads(l) for l in open("reports/passage_rag_capture_2026-05-25.jsonl")]
+by = {}
+for r in recs:
+    by.setdefault(r["q_id"], {})[r["variant"]] = r
+
+def nonempty(r):
+    return (r.get("answer_len_chars") or 0) > 0 and r.get("synthesis_status") == "completed"
+
+# conditional subset: both variants retrieved a target-g2 doc AND both produced an answer
+cond, excluded = [], []
+for qid, vs in by.items():
+    p, h = vs.get("prehier"), vs.get("hier_sim_clean")
+    if not p or not h:
+        excluded.append((qid, "missing variant")); continue
+    if not (p["target_doc_present"] and h["target_doc_present"]):
+        excluded.append((qid, f"tgt_present p={p['target_doc_present']} h={h['target_doc_present']}")); continue
+    if not (nonempty(p) and nonempty(h)):
+        excluded.append((qid, f"empty/skip p={p.get('answer_len_chars')}/{p.get('synthesis_status')} h={h.get('answer_len_chars')}/{h.get('synthesis_status')}")); continue
+    cond.append(qid)
+
+print(f"=== CONDITIONAL SUBSET (둘 다 tgt_present + non-empty) N={len(cond)} ===")
+print("  ", sorted(cond))
+print(f"=== EXCLUDED {len(excluded)} ===")
+for qid, why in sorted(excluded): print(f"   {qid}: {why}")
+
+def halluc(r): return len((r.get("debug") or {}).get("hallucination_flags") or [])
+def grounding_weak(r):
+    g = ((r.get("debug") or {}).get("defense_layers") or {}).get("grounding") or {}
+    return len(g.get("weak") or [])
+def grounding_strong(r):
+    g = ((r.get("debug") or {}).get("defense_layers") or {}).get("grounding") or {}
+    return len(g.get("strong") or [])
+
+print(f"\n=== OBJECTIVE SIGNALS on conditional subset (N={len(cond)}) ===")
+for v in ["prehier", "hier_sim_clean"]:
+    rs = [by[q][v] for q in cond]
+    print(f"  {v}: halluc_flags={sum(halluc(r) for r in rs)} "
+          f"grounding_weak={sum(grounding_weak(r) for r in rs)} "
+          f"grounding_strong={sum(grounding_strong(r) for r in rs)} "
+          f"avg_answer_len={sum(r['answer_len_chars'] for r in rs)//len(rs)} "
+          f"completeness={[r.get('completeness') for r in rs].count('full')}full/"
+          f"{[r.get('completeness') for r in rs].count('partial')}part/"
+          f"{[r.get('completeness') for r in rs].count('insufficient')}insuf "
+          f"refused={sum(1 for r in rs if r.get('refused'))}")
+
+# anonymized 3-file split (conditional only)
+rng = random.Random(42)
+pairs, key = [], {}
+for i, qid in enumerate(sorted(cond)):
+    p, h = by[qid]["prehier"], by[qid]["hier_sim_clean"]
+    swap = rng.random() < 0.5
+    a, b = (p, h) if not swap else (h, p)
+    pid = f"pair_{i+1:02d}"
+    def spans(r): return [e.get("span_text") for e in (r.get("evidence") or []) if e.get("span_text")]
+    pairs.append({
+        "pair_id": pid,
+        "question": p["query"],
+        "answer_A": a["ai_answer"], "evidence_A": spans(a),
+        "answer_B": b["ai_answer"], "evidence_B": spans(b),
+    })
+    key[pid] = {"q_id": qid, "A": a["variant"], "B": b["variant"]}
+
+with open("reports/passage_rag_judge_pairs_2026-05-25.jsonl", "w") as f:
+    for pr in pairs: f.write(json.dumps(pr, ensure_ascii=False) + "\n")
+with open("reports/passage_rag_judge_key_2026-05-25.json", "w") as f:
+    json.dump(key, f, ensure_ascii=False, indent=2)
+print(f"\nwrote {len(pairs)} anonymized pairs → passage_rag_judge_pairs_2026-05-25.jsonl")
+print("wrote key → passage_rag_judge_key_2026-05-25.json (judge 미제공)")
@@ -0,0 +1,32 @@
+import json, os
+os.chdir(os.path.expanduser("~/Documents/code/hyungi_Document_Server"))
+key = json.load(open("reports/passage_rag_judge_key_2026-05-25.json"))
+verdicts = {
+ "pair_01":"B","pair_02":"B","pair_03":"B","pair_04":"tie","pair_05":"B","pair_06":"tie",
+ "pair_07":"tie","pair_08":"tie","pair_09":"B","pair_10":"tie","pair_11":"B","pair_12":"B"}
+# also per-axis scores for faithfulness/correctness/completeness avg by variant
+scores = {
+ "pair_01":(3,3,3,3,3,3),"pair_02":(2,3,2,3,3,3),"pair_03":(3,3,2,3,3,3),"pair_04":(3,3,3,3,3,3),
+ "pair_05":(3,2,2,3,3,3),"pair_06":(3,3,3,3,3,3),"pair_07":(3,3,2,3,3,2),"pair_08":(3,3,3,3,3,3),
+ "pair_09":(3,3,3,3,3,3),"pair_10":(3,3,3,3,3,3),"pair_11":(3,3,2,3,3,3),"pair_12":(3,3,2,3,3,3)}
+win = {"prehier":0,"hier_sim_clean":0,"tie":0}
+axis = {"prehier":[0,0,0],"hier_sim_clean":[0,0,0]}; n=0
+print("pair  q_id        winner_variant")
+for pid,k in key.items():
+    v=verdicts[pid]
+    wv = "tie" if v=="tie" else k[v]
+    win[wv if wv in win else "tie"]+=1
+    print(f"{pid} {k['q_id']:10}  {('TIE' if v=='tie' else wv)}  (A={k['A']},B={k['B']})")
+    fa,ca,coa,fb,cb,cob = scores[pid]
+    sA = {"f":fa,"c":ca,"co":coa}; sB={"f":fb,"c":cb,"co":cob}
+    for slot,sc in ((k['A'],(fa,ca,coa)),(k['B'],(fb,cb,cob))):
+        axis[slot][0]+=sc[0]; axis[slot][1]+=sc[1]; axis[slot][2]+=sc[2]
+    n+=1
+print(f"\n=== PAIRWISE (N={n}) ===")
+print(f"  hier_sim_clean wins: {win['hier_sim_clean']}")
+print(f"  prehier wins:        {win['prehier']}")
+print(f"  tie:                 {win['tie']}")
+print(f"\n=== AXIS AVG (faithfulness/correctness/completeness, N={n}) ===")
+for v in ["prehier","hier_sim_clean"]:
+    f,c,co = axis[v]
+    print(f"  {v}: faith={f/n:.2f} correct={c/n:.2f} complete={co/n:.2f}")