docs(search): passage-RAG go/no-go = NO-GO (hier evidence 동등, diagnose c4+c5)
PR-DocSrv-Hier-PassageRAG-Diagnose-1 c4+c5. 조건부 N=12(retrieval 통제) blind pairwise (hypothesis-blind subagent, 익명 3-file split). 결과 4-way 수렴 = 동등: pairwise prehier4/hier3/tie5(no edge) + axis ±0.08 + objective 동일(halluc36/36) + variance~0(byte-identical 재생성). verbosity artifact 없음(prehier 더 길었으나 승+1). => NO-GO: hier-leaf evidence 무이득. hier leaf = section-outline UI 전용 완전 확정 (UI yes / doc-search NO-GO / passage-RAG NO-GO 3영역 종결). 2026-06-21 freeze input only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,37 @@
|
||||
# Passage-RAG go/no-go — Decision (PR-DocSrv-Hier-PassageRAG-Diagnose-1 c5)
|
||||
|
||||
**측정일** 2026-05-25 | branch `feat/hier-passagerag-diagnose` | 결정 = 🔴 **NO-GO** (hier-evidence 무이득, 동등)
|
||||
|
||||
## 질문
|
||||
Replace-Diagnose(doc-search) NO-GO 가 남긴 후속: `/ask` grounded-synthesis 에서 hier-leaf evidence(정밀 짧은 절)가 legacy 윈도우 evidence 보다 **더 나은 답변**을 만드나? (절의 "짧음"이 doc-검색엔 독이었지만 passage-RAG 엔 약일 수 있다는 가설)
|
||||
|
||||
## 결정 규칙 (N-적응) 적용
|
||||
조건부 **N=12 → ≥12 tier**: pairwise 방향성 신호 유효 + objective 보강. (INCONCLUSIVE 아님 — N 충분)
|
||||
|
||||
## 결과 (상세 = passage_rag_judge_2026-05-25.md)
|
||||
| 차원 | prehier | hier_sim_clean | 판정 |
|
||||
|---|---|---|---|
|
||||
| pairwise (N=12) | 4승 | 3승 (tie 5) | edge 없음 (noise) |
|
||||
| axis faith/correct/complete | 3.00/2.92/2.75 | 2.92/3.00/2.67 | ±0.08 동등 |
|
||||
| objective (halluc/grounding/complete) | 36 / 8·12 / 0full | 36 / 8·11 / 1full | 동일 |
|
||||
| run-to-run 분산 | ~0 (byte-identical 재생성) | ~0 | single-sample 신뢰 |
|
||||
|
||||
**4-way 수렴 = 동등**. verbosity artifact 없음(prehier 더 길었으나 승수 +1). hier 가설은 답변 품질에서 실현 안 됨.
|
||||
|
||||
## 결정: 🔴 NO-GO
|
||||
**hier-leaf 를 `/ask` evidence 단위로 채택/증강하지 않는다.** retrieval 통제 시 답변 품질 무차이 → 절 단위 evidence 의 추가 가치 0. 게다가 end-to-end 로는 hier 의 retrieval 열위(exam_005/006·cl_007 타깃 누락 = doc-search NO-GO 일관)가 더해져, passage-RAG 에서도 hier 채택 근거 없음. `PR-...-PassageRAG-Apply` 류 미진입.
|
||||
|
||||
## hier leaf 운명 = section-outline UI 전용 **완전 확정**
|
||||
3개 평가 영역 모두 종결:
|
||||
- **UI (section-outline)**: ✅ 유효 (배포됨, `PR-DocSrv-Hier-Section-UI-1`)
|
||||
- **doc-level 검색 코퍼스 교체**: 🔴 NO-GO (Replace-Diagnose, −0.074)
|
||||
- **passage-RAG evidence 단위**: 🔴 NO-GO (본 PR, 동등·무이득)
|
||||
→ hier leaf 12,697 = **in_corpus=false 영구 잔존, UI 재료 한정**. 검색/RAG 어느 쪽도 아님.
|
||||
|
||||
## 한계 박제
|
||||
- judge = Claude(Opus) subagent — hypothesis-blind 세션 분리 + label-blind + shuffle + verbosity 차단 + objective 교차로 완화. NDCG 같은 완전 객관 아님 (단 objective 신호가 pairwise 와 동일 결론 → 신뢰 보강).
|
||||
- 조건부 N=12 (작은 spike). generator=gemma 고정, temp~0.3 (variance ~0 실측). single cached generation per Q/variant.
|
||||
- 측정 = doc-search 통제 후 passage 품질 한정. (다른 generator/더 큰 N 에서 미세 차이 가능성 배제 못하나, 본 spike 의 4-way 수렴은 "유의 이득 없음"에 충분.)
|
||||
|
||||
## 제품 input
|
||||
2026-06-21 Ask 파이프라인 freeze 결정의 bounded diagnostic signal: **"hier 로 Ask 품질 개선 안 됨"** — Ask 투자 정당화에 hier 카드 없음. 제품 rollout 결정 아님 ([[document-server-2026-05-21]]).
|
||||
@@ -0,0 +1,72 @@
|
||||
#!/usr/bin/env python3
|
||||
"""c4: conditional subset + objective signals + anonymized 3-file split."""
|
||||
import json, random, os
|
||||
os.chdir(os.path.expanduser("~/Documents/code/hyungi_Document_Server"))
|
||||
recs = [json.loads(l) for l in open("reports/passage_rag_capture_2026-05-25.jsonl")]
|
||||
by = {}
|
||||
for r in recs:
|
||||
by.setdefault(r["q_id"], {})[r["variant"]] = r
|
||||
|
||||
def nonempty(r):
|
||||
return (r.get("answer_len_chars") or 0) > 0 and r.get("synthesis_status") == "completed"
|
||||
|
||||
# conditional subset: both variants retrieved a target-g2 doc AND both produced an answer
|
||||
cond, excluded = [], []
|
||||
for qid, vs in by.items():
|
||||
p, h = vs.get("prehier"), vs.get("hier_sim_clean")
|
||||
if not p or not h:
|
||||
excluded.append((qid, "missing variant")); continue
|
||||
if not (p["target_doc_present"] and h["target_doc_present"]):
|
||||
excluded.append((qid, f"tgt_present p={p['target_doc_present']} h={h['target_doc_present']}")); continue
|
||||
if not (nonempty(p) and nonempty(h)):
|
||||
excluded.append((qid, f"empty/skip p={p.get('answer_len_chars')}/{p.get('synthesis_status')} h={h.get('answer_len_chars')}/{h.get('synthesis_status')}")); continue
|
||||
cond.append(qid)
|
||||
|
||||
print(f"=== CONDITIONAL SUBSET (둘 다 tgt_present + non-empty) N={len(cond)} ===")
|
||||
print(" ", sorted(cond))
|
||||
print(f"=== EXCLUDED {len(excluded)} ===")
|
||||
for qid, why in sorted(excluded): print(f" {qid}: {why}")
|
||||
|
||||
def halluc(r): return len((r.get("debug") or {}).get("hallucination_flags") or [])
|
||||
def grounding_weak(r):
|
||||
g = ((r.get("debug") or {}).get("defense_layers") or {}).get("grounding") or {}
|
||||
return len(g.get("weak") or [])
|
||||
def grounding_strong(r):
|
||||
g = ((r.get("debug") or {}).get("defense_layers") or {}).get("grounding") or {}
|
||||
return len(g.get("strong") or [])
|
||||
|
||||
print(f"\n=== OBJECTIVE SIGNALS on conditional subset (N={len(cond)}) ===")
|
||||
for v in ["prehier", "hier_sim_clean"]:
|
||||
rs = [by[q][v] for q in cond]
|
||||
print(f" {v}: halluc_flags={sum(halluc(r) for r in rs)} "
|
||||
f"grounding_weak={sum(grounding_weak(r) for r in rs)} "
|
||||
f"grounding_strong={sum(grounding_strong(r) for r in rs)} "
|
||||
f"avg_answer_len={sum(r['answer_len_chars'] for r in rs)//len(rs)} "
|
||||
f"completeness={[r.get('completeness') for r in rs].count('full')}full/"
|
||||
f"{[r.get('completeness') for r in rs].count('partial')}part/"
|
||||
f"{[r.get('completeness') for r in rs].count('insufficient')}insuf "
|
||||
f"refused={sum(1 for r in rs if r.get('refused'))}")
|
||||
|
||||
# anonymized 3-file split (conditional only)
|
||||
rng = random.Random(42)
|
||||
pairs, key = [], {}
|
||||
for i, qid in enumerate(sorted(cond)):
|
||||
p, h = by[qid]["prehier"], by[qid]["hier_sim_clean"]
|
||||
swap = rng.random() < 0.5
|
||||
a, b = (p, h) if not swap else (h, p)
|
||||
pid = f"pair_{i+1:02d}"
|
||||
def spans(r): return [e.get("span_text") for e in (r.get("evidence") or []) if e.get("span_text")]
|
||||
pairs.append({
|
||||
"pair_id": pid,
|
||||
"question": p["query"],
|
||||
"answer_A": a["ai_answer"], "evidence_A": spans(a),
|
||||
"answer_B": b["ai_answer"], "evidence_B": spans(b),
|
||||
})
|
||||
key[pid] = {"q_id": qid, "A": a["variant"], "B": b["variant"]}
|
||||
|
||||
with open("reports/passage_rag_judge_pairs_2026-05-25.jsonl", "w") as f:
|
||||
for pr in pairs: f.write(json.dumps(pr, ensure_ascii=False) + "\n")
|
||||
with open("reports/passage_rag_judge_key_2026-05-25.json", "w") as f:
|
||||
json.dump(key, f, ensure_ascii=False, indent=2)
|
||||
print(f"\nwrote {len(pairs)} anonymized pairs → passage_rag_judge_pairs_2026-05-25.jsonl")
|
||||
print("wrote key → passage_rag_judge_key_2026-05-25.json (judge 미제공)")
|
||||
@@ -0,0 +1,32 @@
|
||||
import json, os
|
||||
os.chdir(os.path.expanduser("~/Documents/code/hyungi_Document_Server"))
|
||||
key = json.load(open("reports/passage_rag_judge_key_2026-05-25.json"))
|
||||
verdicts = {
|
||||
"pair_01":"B","pair_02":"B","pair_03":"B","pair_04":"tie","pair_05":"B","pair_06":"tie",
|
||||
"pair_07":"tie","pair_08":"tie","pair_09":"B","pair_10":"tie","pair_11":"B","pair_12":"B"}
|
||||
# also per-axis scores for faithfulness/correctness/completeness avg by variant
|
||||
scores = {
|
||||
"pair_01":(3,3,3,3,3,3),"pair_02":(2,3,2,3,3,3),"pair_03":(3,3,2,3,3,3),"pair_04":(3,3,3,3,3,3),
|
||||
"pair_05":(3,2,2,3,3,3),"pair_06":(3,3,3,3,3,3),"pair_07":(3,3,2,3,3,2),"pair_08":(3,3,3,3,3,3),
|
||||
"pair_09":(3,3,3,3,3,3),"pair_10":(3,3,3,3,3,3),"pair_11":(3,3,2,3,3,3),"pair_12":(3,3,2,3,3,3)}
|
||||
win = {"prehier":0,"hier_sim_clean":0,"tie":0}
|
||||
axis = {"prehier":[0,0,0],"hier_sim_clean":[0,0,0]}; n=0
|
||||
print("pair q_id winner_variant")
|
||||
for pid,k in key.items():
|
||||
v=verdicts[pid]
|
||||
wv = "tie" if v=="tie" else k[v]
|
||||
win[wv if wv in win else "tie"]+=1
|
||||
print(f"{pid} {k['q_id']:10} {('TIE' if v=='tie' else wv)} (A={k['A']},B={k['B']})")
|
||||
fa,ca,coa,fb,cb,cob = scores[pid]
|
||||
sA = {"f":fa,"c":ca,"co":coa}; sB={"f":fb,"c":cb,"co":cob}
|
||||
for slot,sc in ((k['A'],(fa,ca,coa)),(k['B'],(fb,cb,cob))):
|
||||
axis[slot][0]+=sc[0]; axis[slot][1]+=sc[1]; axis[slot][2]+=sc[2]
|
||||
n+=1
|
||||
print(f"\n=== PAIRWISE (N={n}) ===")
|
||||
print(f" hier_sim_clean wins: {win['hier_sim_clean']}")
|
||||
print(f" prehier wins: {win['prehier']}")
|
||||
print(f" tie: {win['tie']}")
|
||||
print(f"\n=== AXIS AVG (faithfulness/correctness/completeness, N={n}) ===")
|
||||
for v in ["prehier","hier_sim_clean"]:
|
||||
f,c,co = axis[v]
|
||||
print(f" {v}: faith={f/n:.2f} correct={c/n:.2f} complete={co/n:.2f}")
|
||||
Reference in New Issue
Block a user