hyungi_document_server

Author	SHA1	Message	Date
hyungi	cd33ded7a8	docs(search): passage-RAG go/no-go = NO-GO (hier evidence 동등, diagnose c4+c5) PR-DocSrv-Hier-PassageRAG-Diagnose-1 c4+c5. 조건부 N=12(retrieval 통제) blind pairwise (hypothesis-blind subagent, 익명 3-file split). 결과 4-way 수렴 = 동등: pairwise prehier4/hier3/tie5(no edge) + axis ±0.08 + objective 동일(halluc36/36) + variance~0(byte-identical 재생성). verbosity artifact 없음(prehier 더 길었으나 승+1). => NO-GO: hier-leaf evidence 무이득. hier leaf = section-outline UI 전용 완전 확정 (UI yes / doc-search NO-GO / passage-RAG NO-GO 3영역 종결). 2026-06-21 freeze input only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 07:02:46 +00:00
hyungi	9c039139ef	feat(search): passage-RAG capture runner + raw JSONL (diagnose c3) PR-DocSrv-Hier-PassageRAG-Diagnose-1 c3. 22Q x {prehier,hier_sim_clean} /ask?debug=true exact_knn capture (44 rec). ai_answer/evidence/target_doc_present/target_span_used/ objective signals(hallucination/grounding/completeness/refused) 박제. 관찰: hier 일부 타깃 retrieval 실패(exam_005/006,cl_007=doc-search NO-GO 일관) + 일부 gain (cl_001/002). empty-answer 케이스(cl_005/cl_007 prehier, cl_006/exam_004 skipped) 존재. JWT 15min 만료로 1차 부분실패 → cache-warm 재실행 완주. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 06:53:11 +00:00
hyungi	6a9142a2e5	docs(search): hier vs legacy go/no-go = NO-GO (replace-diagnose c6) PR-DocSrv-Hier-Replace-Diagnose-1 c6 측정+결정. prehier exact vs hier_sim exact, dedup 0/51. 결정權(분해-subset n=41): prehier 0.748 -> hier_sim_clean 0.675 (-0.074 회귀). raw 0.673 (robust). 카테고리: standards(법령, hier 최적가설) flat -0.002 / exam -0.183 / korean -0.109 / english -0.088. 법령 제N조조차 개선 없음 + 대체로 회귀 → 짧은 절 leaf 가 맥락 손실. dedup clean = 실제값. => NO-GO: 검색 코퍼스 hier 교체 안 함. Apply PR 미진입. hier leaf 는 in_corpus=false 잔존 (section-outline UI 재료, doc-level 검색 무관). 측정은 doc-level NDCG 한정. 산출물: decision md + 4 eval csv(sanity/prehier/clean/raw exact) + subset analysis script. in_corpus 634 전 구간 불변. default 검색 path 회귀 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 05:46:14 +00:00
hyungi	b00d9f5e15	docs(eval): Phase 2Q Category-Analysis — standards/exam 회귀 진단 (inflation 정정) Apply rollout 후속 read-only 진단. Phase 3 측정 (commit `a41adb6`) 의 NDCG 0.927 + standards 1.441 + exam 1.109 = 측정 artifact (top-N doc 중복 박제 → graded NDCG inflation). 진단 path: - script category_analysis_phase2q.py (csv parse + queries.yaml graded lookup + standards/exam 18 case 3-way top-5 박제) - 회귀 큰 case top: kw_004/kw_009/kw_010 = Phase 3 inflation 1.631 → Rerank-Fix 정상 1.000 (baseline 동일, 회귀 0) - kw_001/exam_004 = Rerank-Fix 가 baseline 대비도 회귀 (reranker chunk-level relevance 우선 → doc grade 3 가 rank 5 밀림) 정정값 박제: - Phase 3 NDCG 0.927 → Rerank-Fix 0.876 (정확값) - Δ vs baseline: +0.268 (inflated) → +0.217 (실제 multi-query 효과) - standards 1.441 → 1.157 (vs baseline 0.873, +0.284) - exam 1.109 → 0.918 (vs baseline 0.738, +0.180) 결론: - Apply rollout 결정 = 정정값 기준 invariant 유지 — +0.217 vs baseline = 유의미 net 개선 - standards -0.28 / exam -0.19 회귀 = false alarm (inflation 정정) - 실제 회귀 case (kw_001/exam_004) = Apply 후 telemetry 박제 항목 산출물: - tests/search_eval/baselines/v0_2_phase2q_category_analysis_2026-05-24.md (180+ lines, §1~8) - tests/search_eval/scripts/category_analysis_phase2q.py (read-only csv parse script, reproducibility) 신규 feedback memory: graded-ndcg-dedup-invariant (NDCG > 1.0 = inflation 의심 invariant + dedup audit 필수) 후속 별 chore 후보: - PR-Eval-GradedNDCG-Dedup — run_eval.py 의 graded NDCG 계산 dedup + NDCG > 1.0 warning - PR-2Q-Search-Result-Dedup — _rrf_fuse_variants 의 representative doc_id 중복 audit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 04:23:58 +00:00

4 Commits