Commit Graph

27 Commits

Author SHA1 Message Date
hyungi 85304878f4 test(eval): Phase 2A eval 산출물 — baseline exact 0.6315 vs qwen06/-0.041 qwen4/-0.032 qwen4m/-0.033 (no-go)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 05:53:20 +00:00
hyungi 8ac1dbf4a8 test(eval): Phase 2A E-4 비교기 — per-query win/loss/tie(ε)·부트스트랩 CI·카테고리 분해
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 08:34:18 +09:00
hyungi cd33ded7a8 docs(search): passage-RAG go/no-go = NO-GO (hier evidence 동등, diagnose c4+c5)
PR-DocSrv-Hier-PassageRAG-Diagnose-1 c4+c5. 조건부 N=12(retrieval 통제) blind pairwise
(hypothesis-blind subagent, 익명 3-file split). 결과 4-way 수렴 = 동등:
pairwise prehier4/hier3/tie5(no edge) + axis ±0.08 + objective 동일(halluc36/36) +
variance~0(byte-identical 재생성). verbosity artifact 없음(prehier 더 길었으나 승+1).
=> NO-GO: hier-leaf evidence 무이득. hier leaf = section-outline UI 전용 완전 확정
(UI yes / doc-search NO-GO / passage-RAG NO-GO 3영역 종결). 2026-06-21 freeze input only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 07:02:46 +00:00
hyungi 9c039139ef feat(search): passage-RAG capture runner + raw JSONL (diagnose c3)
PR-DocSrv-Hier-PassageRAG-Diagnose-1 c3. 22Q x {prehier,hier_sim_clean} /ask?debug=true
exact_knn capture (44 rec). ai_answer/evidence/target_doc_present/target_span_used/
objective signals(hallucination/grounding/completeness/refused) 박제.
관찰: hier 일부 타깃 retrieval 실패(exam_005/006,cl_007=doc-search NO-GO 일관) + 일부 gain
(cl_001/002). empty-answer 케이스(cl_005/cl_007 prehier, cl_006/exam_004 skipped) 존재.
JWT 15min 만료로 1차 부분실패 → cache-warm 재실행 완주.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 06:53:11 +00:00
hyungi 698510bc0e feat(search): passage-RAG answer-seeking question subset (diagnose c2)
PR-DocSrv-Hier-PassageRAG-Diagnose-1 c2. queries.yaml v0.2 의 answer-seeking 22문항
(exam 7 + korean_only 7 + mixed 8, decomposed-target 필터). targets_g2/g3 = 조건부 subset
산출용. broad seed (조건부 ~65-70% → N≥12 확보). 신규 authoring 0 (기존 graded 재사용).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 06:20:20 +00:00
hyungi 6e9d73278f docs(search): pin hier measurement views as EVAL-ONLY (replace-diagnose)
COMMENT ON VIEW + header — corpus_chunks_{prehier,hier_sim_raw,hier_sim_clean} 은
?corpus_variant= eval dispatch 전용. production retrieval default-path 는 corpus_chunks
(partial ivfflat) 만. 재측정/passage-RAG 재평가 자산으로 보존, 오용 방지 박제.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:53:04 +00:00
hyungi 6a9142a2e5 docs(search): hier vs legacy go/no-go = NO-GO (replace-diagnose c6)
PR-DocSrv-Hier-Replace-Diagnose-1 c6 측정+결정. prehier exact vs hier_sim exact, dedup 0/51.
결정權(분해-subset n=41): prehier 0.748 -> hier_sim_clean 0.675 (-0.074 회귀). raw 0.673 (robust).
카테고리: standards(법령, hier 최적가설) flat -0.002 / exam -0.183 / korean -0.109 / english -0.088.
법령 제N조조차 개선 없음 + 대체로 회귀 → 짧은 절 leaf 가 맥락 손실. dedup clean = 실제값.
=> NO-GO: 검색 코퍼스 hier 교체 안 함. Apply PR 미진입. hier leaf 는 in_corpus=false 잔존
(section-outline UI 재료, doc-level 검색 무관). 측정은 doc-level NDCG 한정.

산출물: decision md + 4 eval csv(sanity/prehier/clean/raw exact) + subset analysis script.
in_corpus 634 전 구간 불변. default 검색 path 회귀 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:46:14 +00:00
hyungi 100aaa3b0c feat(search): corpus_variant + exact_knn measurement dispatch (replace-diagnose c4+c5)
PR-DocSrv-Hier-Replace-Diagnose-1 c4+c5. hier vs prehier(legacy) go/no-go 비파괴 측정 hook.
- 측정 뷰 3종 (hier_measure_views.sql, additive/droppable): corpus_chunks_prehier
  (legacy+null-source 375 포함) / hier_sim_raw / hier_sim_clean (childless-tiny<30 제외,
  all-tiny doc 은 legacy fallback 정합).
- retrieval_service: _resolve_corpus_variant + CORPUS_VARIANT_MAP + _VALID_CHUNKS_TABLE
  3 뷰 추가 + exact_knn(SET LOCAL enable_indexscan/bitmapscan=off, eval 전용).
  chunk leg 만 영향 (doc-level + fts/trgm = documents 무관). baseline/None path 회귀 0.
- search_pipeline.run_search + search.py: corpus_variant/exact_knn 전달, unknown→400,
  embedding_backend cand 와 동시 사용 금지(400).
- run_eval: --corpus-variant + --exact-knn flag.
- tests/test_corpus_variant.py 22 PASS (resolver/map/allowlist + SQL injection 거부).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:37:15 +00:00
hyungi e860baa179 ops(hier): Phase A law/library decompose + snapshot freeze (replace-diagnose c3)
47 eval-target undecomposed non-news docs (law21+library24+document2) 분해+임베딩
(--skip-analysis, additive). 1005 leaf 생성 fail0, in_corpus 634 무손상 검증.
snapshot doc_id_max=25912 chunk_id_max=71164 docs_decomposed 301->348. 측정 drift 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 05:23:38 +00:00
hyungi 3b753f18d6 fix(search): Phase 2Q result dedup — apply_diversity unlimited path doc_id inflation 차단
PR-2Q-Search-Result-Dedup. measurement chain 의 마지막 cleanup. plan inline.

root cause: apply_diversity 의 top_score ≥ 0.90 → unlimited path (diversity 제약 해제)
→ 같은 doc 의 N chunks 가 results 에 박제 → returned_ids 에 doc.id 중복 → 모든 graded
metric inflation. multi-query 의 reranker score 가 자주 0.90+ → 다수 case 영향.

변경 (baseline path 영향 0, multi-query 전용 invariant):
- app/services/search/search_pipeline.py:
  · _dedup_results_by_doc_id() helper 신규 (doc.id first-only, top score 보존)
  · search_with_rewrite() 의 rerank path 에 apply_diversity(top_score_threshold=2.0)
    강제 + 후속 _dedup_results_by_doc_id 적용
  · rerank=False path 도 _dedup_results_by_doc_id(unified_docs) 적용
- tests/test_query_rewriter.py — 신규 4 test (55/55 PASS)

🎯 진짜 측정값 (모든 dedup layer 적용, 51 case gemma):
  cold: NDCG 0.663 / Recall t≥2 0.729 / Recall t≥3 0.761 / p50 3692ms / p95 9992ms
  warm: NDCG 0.659 / Recall t≥2 0.721 / Recall t≥3 0.739 / p50 1588ms / p95 3514ms
  baseline (rewrite_backend=null): NDCG 0.644 / Recall t≥2 0.699 / Recall t≥3 0.761 / p50 378ms
  Dedup audit: gemma 0/51 ✓ 정상 (fix 작동, eval-dedup 42/51 → 0/51 회복)

Δ vs baseline (진짜 multi-query 효과):
  NDCG +0.019 (cold) / +0.015 (warm) — sub-noise level
  Recall t≥2 +0.030 (cold) / +0.022 (warm) — 소량 개선
  Recall t≥3 0.000 / -0.022 — 동등~약간 회귀
  latency p50 +876% (cold) / +320% (warm) — major cost
  category: english/standards/mixed 약간 우세 / exam/korean 약간 회귀

measurement chain 정정 history:
  Phase 3 (a41adb6) 0.927 — chunk_id 중복 inflation
  Rerank-Fix (b734fc5) 0.876 — doc_id 중복 잔재
  Eval-Dedup (3553573) 0.641 — eval layer 만 dedup
  Result-Dedup (본 PR) 0.663 — production + eval 둘 다 dedup ← 정확값

사용자 결정 필요 (3 path, json 박제):
  (a) rollback — marginal 개선이 latency cost 정당화 X
  (b) opt-in 유지 + PR-2Q-Cache-Prewarm 진입 (warm path 만 노출)
  (c) 1주 관찰 종료 후 (2026-05-31) 재결정 (현 상태 유지)

산출물:
  reports/v0_2_phase2q_result_dedup_gemma_{cold,warm}_2026-05-24.csv
  tests/search_eval/baselines/v0_2_phase2q_result_dedup_2026-05-24.json (요약 + 사용자 결정 옵션)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 04:48:50 +00:00
hyungi 9dad5e6289 chore(eval): graded NDCG dedup + warning + audit stats (Phase 2Q inflation 정정)
PR-Eval-GradedNDCG-Dedup. [[feedback_graded_ndcg_dedup_invariant]] cleanup.
plan pr-eval-graded-ndcg-dedup-stormy-tide.md.

변경:
- tests/search_eval/run_eval.py:
  · _dedup_returned_ids() helper — returned[:k] 첫 등장 순서 보존 dedup + count 반환
  · count_dedup() wrapper (audit 용)
  · ndcg_at_k + graded_ndcg_at_k 진입 시 dedup (NDCG > 1.0 invariant 강제)
  · QueryResult.dedup_count 필드 + csv schema 신규 column
  · evaluate() 에서 dedup_count > 0 시 stderr WARNING
  · print_summary 에 dedup audit stats (cases/total chunks + 정상/⚠️ flag)
- tests/search_eval/test_eval_graded_ndcg_dedup.py 신규 — 13 test:
  · _dedup_returned_ids 6 (empty / no-dup / dup-first / k-limit / count helper / Phase 2Q kw_001)
  · graded_ndcg invariant 5 (baseline 회귀 0 / dup 차단 / all-dup / exam_001 regression / empty grades)
  · ndcg_at_k binary dedup 1 + graded_recall set 변환 1

51/51 test PASS (13 신규 + 38 기존 회귀 0).

🚨 CRITICAL 측정 발견:
  dedup audit baseline = 0/51 정상 (single-query path 의 retrieval 가 doc unique 박제)
  dedup audit gemma = 42/51 (totaling 81 chunks dedup) ⚠️
  → _rrf_fuse_variants 의 representative 보존 logic 이 같은 doc_id 의 여러 SearchResult
    를 unique 가정. chunk_id dedup (Rerank-Fix) 이후에도 doc_id 중복 잔재.

정정값 (이번이 가장 정확):
  baseline NDCG 0.644 (이전 0.659 와 noise level diff)
  gemma NDCG 0.641 → Δ vs baseline = -0.003 (사실상 동일, multi-query 실제 net 효과 ≈ 0)
  latency p50 +1005ms (+266%) — 회귀
  Recall t≥3 -0.033 (회귀)

이전 박제값 (모두 inflation):
  Phase 3 (a41adb6) NDCG 0.927 — chunk_id 중복
  Rerank-Fix (b734fc5) NDCG 0.876 — doc_id 중복 잔재
  Category-Analysis (b00d9f5) NDCG 0.876 정정 박제 — 위와 동일

산출물:
  reports/v0_2_phase2q_eval_dedup_baseline_2026-05-24.csv (baseline 회귀 verify)
  reports/v0_2_phase2q_eval_dedup_gemma_2026-05-24.csv (실제 효과 측정)
  tests/search_eval/baselines/v0_2_phase2q_eval_dedup_2026-05-24.json (요약 + critical 권고)

권고 (사용자 결정 필요):
  1. Apply rollback 검토 — multi-query 의 실제 net 효과 ≈ 0 + latency 4x 회귀
  2. 또는 PR-2Q-Search-Result-Dedup 진입 (real fix _rrf_fuse_variants representative)
     후 재측정 → 실제 multi-query 효과 측정 후 Apply 결정

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 04:35:33 +00:00
hyungi b00d9f5e15 docs(eval): Phase 2Q Category-Analysis — standards/exam 회귀 진단 (inflation 정정)
Apply rollout 후속 read-only 진단. Phase 3 측정 (commit a41adb6) 의 NDCG 0.927 + standards 1.441 + exam 1.109 = **측정 artifact (top-N doc 중복 박제 → graded NDCG inflation)**.

진단 path:
- script category_analysis_phase2q.py (csv parse + queries.yaml graded lookup + standards/exam 18 case 3-way top-5 박제)
- 회귀 큰 case top: kw_004/kw_009/kw_010 = Phase 3 inflation 1.631 → Rerank-Fix 정상 1.000 (baseline 동일, 회귀 0)
- kw_001/exam_004 = Rerank-Fix 가 baseline 대비도 회귀 (reranker chunk-level relevance 우선 → doc grade 3 가 rank 5 밀림)

정정값 박제:
- Phase 3 NDCG 0.927 → **Rerank-Fix 0.876 (정확값)**
- Δ vs baseline: +0.268 (inflated) → **+0.217 (실제 multi-query 효과)**
- standards 1.441 → 1.157 (vs baseline 0.873, +0.284)
- exam 1.109 → 0.918 (vs baseline 0.738, +0.180)

결론:
- **Apply rollout 결정 = 정정값 기준 invariant 유지** — +0.217 vs baseline = 유의미 net 개선
- standards -0.28 / exam -0.19 회귀 = false alarm (inflation 정정)
- 실제 회귀 case (kw_001/exam_004) = Apply 후 telemetry 박제 항목

산출물:
- tests/search_eval/baselines/v0_2_phase2q_category_analysis_2026-05-24.md (180+ lines, §1~8)
- tests/search_eval/scripts/category_analysis_phase2q.py (read-only csv parse script, reproducibility)

신규 feedback memory: graded-ndcg-dedup-invariant (NDCG > 1.0 = inflation 의심 invariant + dedup audit 필수)

후속 별 chore 후보:
- PR-Eval-GradedNDCG-Dedup — run_eval.py 의 graded NDCG 계산 dedup + NDCG > 1.0 warning
- PR-2Q-Search-Result-Dedup — _rrf_fuse_variants 의 representative doc_id 중복 audit

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 04:23:58 +00:00
hyungi 59bde9a399 feat(search): phase-2q apply opt-in — production rollout 시작, 1주 관찰 (gemma-4)
plan pr-2q-apply-query-rewrite-1-bright-meadow.md. Phase 2Q Diagnose closure +
Rerank-Payload-Fix (main 0257a5d) 완료 후 Apply rollout. opt-in path 가 Phase 1B/2
부터 이미 production 가동 중 → 본 PR 의 production 영향 0 (marker PR).

rollout 정책:
  · default = rewrite_backend null (single-query path, baseline 회귀 0 invariant)
  · 명시 opt-in = ?rewrite_backend=cand_multi_query_macmini (추천 gemma-4)
  · 대안 = cand_multi_query_macbook (qwen3.6, mixed/english 강점, MacBook 가동 시)
  · 1주 관찰 (2026-05-24 ~ 2026-05-31) → metric 정상 시 default ON 별 PR

변경 (production 영향 0):
- docs/phase_2q_apply_opt_in.md 신규 — 사용자 가시화:
  · 사용 방법 (query param + SvelteKit fetch 예시)
  · 1주 관찰 metric 목표 (cache hit ≥ 50% / LLM warm p50 ≤ 1500 / 503 ≤ 5/day / Recall t≥3 ≥ 0.74)
  · 추천 LLM 사유 (decision md §4 4-factor) + 대안 명시
  · Phase 2 QueryAnalyzer sequencing 박제 (영향 0, ask_events 0건 운영 관찰 후 확정)
  · Follow-up PR 5건 명시 (Telemetry / Alert / Default-ON / Cache-Prewarm / Category-Analysis)
- app/api/search.py — rewrite_backend query param description 갱신.
  Apply 진입 박제 + 추천 LLM 표시 + docs 링크. 동작 변경 0.
- tests/search_eval/baselines/v0_2_phase2q_apply_smoke_2026-05-24.json — production smoke:
  · opt-in path HTTP 200 + total_ms 957 (cache hit) + rerank_ms 109 (정상 호출) + fallback 0
  · baseline path HTTP 200 + total_ms 207 + rerank_ms 19 + fallback 0 (회귀 0 확정)

38/38 unit test PASS (회귀 0). main HEAD 0257a5d 위 branch.

Closure gate PASS:
  · docs 가시화 / search.py description / smoke json 박제
  · production smoke 양쪽 path 정상 + 회귀 0 verify
  · 메모리 갱신 + 1주 관찰 종료일 2026-05-31 박제

Follow-up: 1주 후 PR-2Q-Apply-Default-ON-1 (metric 정상 시) 또는 fix PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 04:01:49 +00:00
hyungi b734fc54af fix(search): Phase 2Q rerank payload — chunk_id dedup + cap 60 + TEI batch 64 (Apply prereq)
plan pr-2q-rerank-payload-fix-resolute-haven.md. Phase 2Q multi-query path 의 reranker
413 Payload Too Large root cause = TEI 의 MAX_CLIENT_BATCH_SIZE=32 default (batch entries
한도) + multi-query 의 chunks 누적이 32 초과. MAX_BATCH_TOKENS 와 별개 (token sum 한도).

4 iteration 진단 history (json 박제):
  1) cap 60 + dedup = 413 다수 (batch 54 > 32)
  2) cap 30 + chunks_per_doc=1 = 413 0건 + NDCG 0.666 catastrophic (-0.261)
  3) cap 60 + dedup + TEI 16384 only = 413 46건 (batch size 한도 별개)
  4) cap 60 + dedup + TEI 16384/64 = 413 1건 + NDCG 0.876 (FINAL)

변경:
- app/services/search/search_pipeline.py:
  · _dedup_chunks_by_id() 신규 helper — chunk_id (None 시 doc.id) 기준 first-only.
    variant 별 same chunk 중복 누적 회피, 첫 등장 variant 보존.
  · PHASE2Q_RERANK_INPUT_CAP=60 + PHASE2Q_CHUNKS_PER_DOC=2 신규 상수 (baseline
    MAX_RERANK_INPUT=200 / MAX_CHUNKS_PER_DOC=2 와 별도).
  · search_with_rewrite() merge 후 dedup wire-up + rerank input cap swap.
- docker-compose.yml reranker env (사용자 결정, plan out-of-scope 정정):
  · MAX_BATCH_TOKENS 8192 → 16384 (token sum 한도)
  · MAX_CLIENT_BATCH_SIZE 32 → 64 신규 추가 (batch entries 한도 — root cause)
  · GPU VRAM free 6199MiB 충분 사전 verify.
- tests/test_query_rewriter.py: _dedup_chunks_by_id 5 test + PHASE2Q_* constants test.
  38/38 PASS (기존 32 + 신규 6).

측정 결과 (51 case, gemma backend, snapshot 25180/56526):
  vs Phase 3 (commit a41adb6 NDCG 0.927, 413 다수):
  · NDCG 0.876 (-0.051 acceptable, plan 변수 격리 invariant 충족)
  · Recall t≥2 0.721 (+0.034 회복)
  · Recall t≥3 0.739 (+0.011)
  · latency p50 1421ms (-1336ms, -48%) / p95 3392ms (-6292ms, -65%) major win
  · 413 fallback 1/51 (98%↓ from 다수) + reranker batch error 0
  · 카테고리 english_only +0.34 / standards -0.28 / exam -0.19 (Apply 후 분석 항목)

closure gate PASS:
  · unit test 38/38, production smoke 413 0
  · 51 case 413 < 5/51 (1건만)
  · latency 대폭 개선
  · NDCG threshold 0.92 미달 단 plan invariant (production 평가 단일 변수) 충족
  · Apply PR-2Q-Apply-Query-Rewrite-1 진입 ready

산출물:
  · reports/v0_2_phase2q_rerank_fix_2026-05-24.csv (raw)
  · tests/search_eval/baselines/v0_2_phase2q_rerank_fix_2026-05-24.json (4 iter 진단 박제)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 03:54:59 +00:00
hyungi c57e4c52dc docs(eval): Phase 2Q Diagnose Phase 4 — decision tree md + Apply PR 백로그
phase-2q-query-rewrite-diagnose.md v6 plan §7 Phase 4 closure.
Phase 3 commit a41adb6 의 3 측정 결과 + 4 factor weighted decision.

decision = H1 (both backends NDCG net 개선 ≥ +0.26):
- 추천 Apply LLM = cand_multi_query_macmini (gemma-4)
- 사유: F3  24/7 가동 + F1 NDCG 0.927 dominant + F4 cold latency 우세
- 대안: qwen (mixed/english 강점 + MacBook always-on 의향 시)

산출물:
- tests/search_eval/baselines/v0_2_phase2q_decision_2026-05-24.md (180 lines)
  · §1 결정 요약 / §2 측정 표 / §3 카테고리 회복 / §4 4-factor weighted
  · §5 분석 노트 5건 (multi-query 효과 / variants 구성 / cache hit / Recall 회귀 / Phase 3 incident)
  · §6 closure gate (branch close 사용자 결정 보류)
  · §7 follow-up PR 백로그: Apply 1 + 별 chore 2 + Extended 4 + Cloud 1 + Cleanup 1
  · §9 사용자 검토 항목 5건

Phase 2Q Diagnose closure 완료. Apply PR 진입 = 사용자 LLM 선택 + sequencing 결정 후.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:57:48 +00:00
hyungi a41adb63a0 fix(search): Phase 2Q variants bug fix + Phase 3 3 measurement 박제
Phase 3 cold 측정 1차에서 NDCG 0.033 catastrophic 발견 — 모든 query 에 동일 variants
반환. root cause = _call_llm 이 user 메시지 1개에 prompt template 전체 박음. LLM 이
actual query 인식 못 함. fixture request_body 형식 (system=prompt / user=query) 과
mismatch. fixture-first invariant 위반.

fix:
- app/services/search/query_rewriter.py _call_llm — system/user 메시지 분리.
  fixture request_body 와 단일 source-of-truth. _render_prompt 는 [deprecated] 유지.
- tests/test_query_rewriter.py — Phase 3 regression test 2:
  · _call_llm 가 system + user 분리 호출 verify (httpx.AsyncClient monkeypatch)
  · qwen backend = response_format 미사용 verify
- 32/32 unit test PASS.

Phase 3 측정 (fix 후 재측정, 51 case × 3 candidate × cold/warm = 5 run):
- baseline_rebaseline (rewrite_backend=null): NDCG 0.659 = Phase 2A 0.659, diff 0.000 PASS
- cand_multi_query_macmini cold: NDCG 0.927 (Δ +0.268), p50 2757ms / p95 9684ms
- cand_multi_query_macmini warm: NDCG 0.927 동일, p50 998ms (cache hit -64%)
- cand_multi_query_macbook cold: NDCG 0.919 (Δ +0.260), p50 3647ms / p95 5202ms
- cand_multi_query_macbook warm: NDCG 0.919 동일, p50 873ms (cache hit -76%)

핵심 약점 회복 (gemma / qwen):
- mixed 0.39 → 0.57 / 0.65
- korean_only 0.51 → 0.71 / 0.67
- standards 0.87 → 1.44 / 1.31
- exam 0.74 → 1.11 / 1.04

decision = H1 (both backends 유의미 net 개선). LLM 선택 = Phase 4 decision md 별 step.

산출물:
- reports/v0_2_phase2q_*.csv (5 raw run_eval output)
- tests/search_eval/baselines/v0_2_phase2q_results_2026-05-24.json (요약 + incident 박제)

follow-up:
- rerank 413 Payload Too Large 다수 관찰 (RRF fallback 작동, NDCG 영향 없음). Apply PR 전 별 chore — chunk dedup 또는 reranker batch cap 검토.
- p95 cold 9684ms 매우 큼. production rollout 시 cache prewarm 정책 필수.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 00:51:56 +00:00
hyungi 3e6866b4ae feat(search): Phase 2Q Diagnose Phase 1B — scaffold + dispatcher
phase-2q-query-rewrite-diagnose.md v6 plan Phase 1 의 fixture 외 잔여.
Phase 1A 446ba82 위 dispatcher + cache + LLM call + API param + eval flag + 21 unit test.
retrieval 합성 (search_with_rewrite) 은 Phase 2 별 commit.

신규:
- app/services/search/query_rewriter.py — LLM_BACKEND_MAP + _resolve + cache + rewrite()
  · slug-based allowlist (no silent fallback), httpx 직접, Priority.FOREGROUND semaphore
  · sampling 박제 (gemma response_format json_object / qwen prompt rule only — Phase 0 inspect 9)
  · manual TTL cache (query_analyzer 패턴 1:1, sha256[:32] NFKC key, LLM_REWRITE_TIMEOUT_MS=15000)
- tests/test_query_rewriter.py — 21 test PASS (resolve / cache key / parser / cache TTL / constants)

수정:
- app/api/search.py — ?rewrite_backend= query param + 400 unknown / 503 unavailable.
  scaffold = call but discard variants (retrieval path 영향 0). Phase 2 에서 합성.
- tests/search_eval/run_eval.py — --rewrite-backend flag + 4 hot spot wire-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 22:25:03 +00:00
hyungi 076c0e1802 feat(eval): Phase 2B Reranker Diagnose — dispatcher + gte 측정 + decision (H3 bge-reranker-v2-m3 유지)
round-2-review-mighty-starfish.md v2.1 (Phase 2B Reranker Diagnose) plan 실행.
Phase 2A 의 CANDIDATE_BACKEND_MAP 패턴 재사용 + RERANKER_BACKEND_MAP 신규.

코드 변경 (4 파일):
- app/services/search/rerank_service.py:
  - RERANKER_BACKEND_MAP allowlist (baseline / cand_gte_ml_base, slug-based resolve)
  - _resolve_reranker(slug) → endpoint URL or None
  - _rerank_via_candidate_endpoint() — 후보 TEI POST /rerank
  - rerank_chunks() 시그니처에 reranker_backend + snapshot_*_id_max 추가 + dispatch log
- app/services/search/search_pipeline.py: run_search() threading
- app/api/search.py: reranker_backend Query parameter + 400 unknown_reranker_backend 에러 매핑
- tests/search_eval/run_eval.py: --reranker-backend flag + call_search/evaluate threading

infra:
- docker-compose.override.rerank-cand.yml: 3 후보 service (gte_ml_base / mxbai_large / bge_v2_gemma_2b),
  profile 'rerank-cand' 격리, restart=unless-stopped

측정 산출물 (51 case, scored=46, failure=5):
- reports/v0_2_phase2b_baseline_snapshot_2026-05-23.csv (NDCG 0.659, Phase 2A 와 일치 = 재현성 PASS)
- reports/v0_2_phase2b_gte_ml_base_2026-05-23.csv
- tests/search_eval/baselines/v0_2_phase2b_{baseline_snapshot,gte_ml_base}_2026-05-23.json
- reports/phase_2b_reranker_decision_2026-05-23.md
- tests/fixtures/tei_rerank_response.json (G0-1 한국어+영어 mixed sample sanity PASS)

후보 TEI 1.7 호환성 (Phase 1 smoke gate):
- cand_gte_ml_base       :  PASS (xlm-roberta-based, TEI 호환)
- cand_mxbai_large       :  deberta-v2 미지원 → Phase 2B-Extended (sentence-transformers wrapper)
- cand_bge_v2_gemma_2b   :  LLM-based reranker, 1_Pooling/config.json 부재 → Phase 2B-Extended (FlagEmbedding wrapper)

결과 (1 후보 측정 + baseline rebaseline):
| Candidate                          | NDCG  | Δ baseline | mixed | korean | exam  | p50 ms |
|------------------------------------|------:|-----------:|------:|-------:|------:|-------:|
| bge-reranker-v2-m3 (baseline)      | 0.659 | —          | 0.39  | 0.51   | 0.74  | 454    |
| cand_gte_ml_base                   | 0.604 | -0.055     | 0.38  | 0.41   | 0.62  | 345    |

Decision (H3): bge-reranker-v2-m3 유지. gte 의 reranker quality 가 production 보다 약함 (korean_only -0.10, exam -0.12, overall -0.055).

후속 PR 백로그 (6건):
- PR-Search-Query-Rewrite-1 (Phase 2Q, korean_only/mixed 보완 권고)
- PR-2B-Extended-Mxbai-Large (sentence-transformers wrapper)
- PR-2B-Extended-Bge-V2-Gemma (FlagEmbedding LayerwiseReranker wrapper)
- PR-2B-Extended-Jina-V2-ML (license 결정 후, 개인 비영리 가정)
- PR-2B-Cloud-Reranker-Scaffold-1 (Cohere scaffold-only, 선택)
- PR-2B-Rerank-Cand-Cleanup-1 (1주 후 cand 컨테이너 정리)

production 영향:
- production reranker (bge-reranker-v2-m3) 변경 0
- config.yaml ai.models.rerank.endpoint 변경 0
- embedding (bge-m3 ollama) 변경 0 (Phase 2A 결정 보존)
- documents / document_chunks 변경 0 (21365 docs / 30605 chunks 그대로)
- 4 smoke PASS (baseline / baseline+snapshot / cand_gte_ml_base / cand_invalid → 400)
- dispatch log 박제 verify (endpoint + snapshot id)

closure gate: 16 항목 PASS (flex closure 조항 적용 — 1 후보 측정, 2 후보 TEI 호환 탈락 사유 명시).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 08:37:42 +00:00
hyungi 3092e3009d feat(eval): Phase 2A Diagnose Phase 3+4 — dispatcher + 3 측정 + decision (H3 bge-m3 유지)
phase-2a-embedding-diagnose.md v4 § 6 (dispatcher) + § 7 Phase 3 (51 case 측정) + § 7 Phase 4 (decision)
Round 2 review: round-2-review-mighty-starfish.md (R2-2 + R2-B1 페어 invariant + slug-based resolve)

코드 변경:
- app/services/search/retrieval_service.py:
  - CANDIDATE_BACKEND_MAP allowlist (baseline / cand_me5_large_inst / cand_snowflake_l_v2)
  - _resolve_backend(slug) → docs_table/chunks_table/embed_endpoint or None
  - _embed_query_via_tei() — candidate TEI 엔드포인트 호출 (cache 미사용)
  - _VALID_DOCS_TABLE + _VALID_CHUNKS_TABLE regex (R2-B1 2단계 gate)
  - _search_vector_docs / _search_vector_chunks: docs_table/chunks_table + snapshot_*_id_max 파라미터
  - search_vector + search_vector_multilingual: embedding_backend + snapshot_*_id_max 파라미터 + dispatch log
- app/services/search/search_pipeline.py: run_search() 시그니처 + 4 search_vector* 호출 threading
- app/api/search.py: 3 Query parameter + ValueError → HTTP 400 (allowed list 응답)
- tests/search_eval/run_eval.py: --embedding-backend + --snapshot-doc-id-max + --snapshot-chunk-id-max
  + call_search/call_search_full/evaluate threading + main 3 asyncio.run threading

측정 산출물 (51 case, scored=46, failure=5):
- reports/v0_2_phase2a_baseline_snapshot_2026-05-23.csv (snapshot filter 적용 production path)
- reports/v0_2_phase2a_me5_large_inst_2026-05-23.csv
- reports/v0_2_phase2a_snowflake_l_v2_2026-05-23.csv
- tests/search_eval/baselines/v0_2_phase2a_{baseline_snapshot,me5_large_inst,snowflake_l_v2}_2026-05-23.json (3개)

결과:
| Candidate                          | NDCG | Δ vs baseline | mixed | korean_only | p50 ms |
|------------------------------------|-----:|--------------:|------:|------------:|-------:|
| bge-m3 (baseline snapshot)         | 0.659| —             | 0.39  | 0.51        | 464    |
| cand_me5_large_inst                | 0.477| -0.182        | 0.17  | 0.47        | 194    |
| cand_snowflake_l_v2                | 0.616| -0.043        | 0.35  | 0.52        | 254    |

Decision (H3): bge-m3 유지. 둘 다 net 회귀.
- mE5-large-instruct: 전 카테고리 회귀 (-0.182). prefix 미적용 변수 — 별 PR PR-2A-mE5-Prefix-Retry 후보.
- snowflake_l_v2: 가벼운 회귀 (-0.043). korean_only +0.01 미세 개선 신호.
- korean_only/mixed 약점 보완은 Phase 2B (Reranker) 또는 Phase 2Q (Query rewrite) 권고.

Decision report: reports/phase_2a_embedding_decision_2026-05-23.md (§ 1~8 포함, Closure gate 16 항목 모두 PASS).

후속 PR 백로그:
- PR-2A-mE5-Prefix-Retry (별 PR)
- PR-2A-Extended-Bge-Mgemma2 (별 PR, v3 결정)
- PR-2A-Cloud-Embedding-Scaffold-1 (Cohere/Voyage scaffold-only, 선택)
- PR-Search-Query-Rewrite-1 (Phase 2Q)
- PR-Search-Reranker-V2-Diagnose (Phase 2B)
- PR-2A-Chunks-Cand-Cleanup-1 (1주 후 cand 테이블 DROP)

production 영향:
- documents / document_chunks 컬럼/row 변경 0
- config.yaml 변경 0 (ollama bge-m3 unchanged)
- 추가된 endpoint = query parameter opt-in (미지정 시 production path 회귀 0)
- smoke 4건 PASS (baseline / baseline+snapshot / cand_me5 / cand_invalid → HTTP 400)
- dispatch log 박제 verify (snapshot_doc/chunk_id_max 박제)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 06:55:13 +00:00
hyungi a67df0a10b feat(eval): Phase 2A Diagnose Phase 2 — candidate reindex (me5 + snowflake 페어)
phase-2a-embedding-diagnose.md v4 § 7 Phase 2 산출.
페어 invariant (R2-2): documents_cand + document_chunks_cand 동기 swap, 부분 swap 금지.

- snapshot 박제 (R2-D): v0_2_phase2a_snapshot_2026-05-23.json
  - SNAPSHOT_DOC_ID_MAX=25180 / SNAPSHOT_CHUNK_ID_MAX=56526
  - documents_n=21365 (embedded, active) / chunks_n=30605
  - production ingest 정지 0, 모든 candidate reindex + baseline rebaseline 측정이 id<=snapshot 한정

- reindex_candidate.py 신규 (R2-5):
  - reindex_documents(): production _build_embed_input() import 재사용
  - reindex_chunks(): document_chunks.text 그대로 (재 chunking 0)
  - TEI batch=8 (1.7 internal queue overflow 회피) + truncate=true (mE5 512 context)
  - retry-8 exponential backoff (10/20/40/80/90s) — TEI SIGSEGV 자동 복구
  - idempotent ON CONFLICT DO NOTHING (cancellation/resume 안전)

- docker-compose.override.cand.yml: restart=unless-stopped (TEI 1.7 panic 자동 복구)

DB 산출물 (4 테이블):
  - documents_cand_me5_large_inst       : 21365 rows (dim 1024) + ivfflat lists=100
  - document_chunks_cand_me5_large_inst : 30605 rows (dim 1024) + ivfflat lists=100
  - documents_cand_snowflake_l_v2       : 21365 rows (dim 1024) + ivfflat lists=100
  - document_chunks_cand_snowflake_l_v2 : 30605 rows (dim 1024) + ivfflat lists=100
  - ivfflat.probes=20 (production 동일) 보존
  - smoke retrieval (nearest neighbor SQL) PASS 후보 2종

production 영향:
  - documents / document_chunks 컬럼/row 변경 0
  - config.yaml 변경 0 (ollama bge-m3 unchanged)
  - production fastapi/postgres/reranker 변경 0 (profile embed-cand 격리)

다음 단계: Phase 3 (DS API + retrieval_service slug-based dispatcher 추가, baseline rebaseline + 2 후보 51 case 측정).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 06:26:14 +00:00
hyungi 4d14ab69d9 feat(eval): v0.2 28 신규 case + 2026-05-23 baseline + analysis
PR-1 (725a4e1) v0.2 schema + harness 위에 신규 28 case 추가 → 51 case
완성 + 현재 모델로 baseline 박제 + 약점 카테고리 analysis md.

신규 28 case 분포 (계획 +28 = standards +6 / english_only +8 / mixed +5
/ exam +7 / failure_expected +2 / ocr_derived 0):
- standards 5 → 11 (KGS FP111/FU551 + 산안기준 후반 편 + 고압가스법)
- english_only 1 → 9 (Pressure Vessel Design Manual + ASME VIII/IX +
  Hydrogen ASME + Industrial Safety 영문 교재 + Structural Analysis)
- mixed 5 → 10 (한↔영 ASME / KGS-영문 / 양언어 압력용기)
- exam 0 → 7 (가스기사 study_questions → library 개념 docs 매핑)
- failure_expected 3 → 5 (KGS AC999 / 초전도 안전 관리법)
- ocr_derived 0 (TBD-O FAILED: extract_meta NULL 21385, chunks.source
  = RSS feed 명. OCR 식별 컬럼 부재 → +4 case 재배분, analysis 명시)

baseline 측정 결과 (corpus 21,385, hybrid mode, bge-m3 + bge-reranker-v2-m3):
- v0.1 Recall@10 0.646, MRR 0.724, NDCG 0.606, Top-3 0.891
- v0.2 graded NDCG 0.659, Recall@10 g≥2 0.695, g≥3 0.761
- latency p50 528ms / p95 1,664ms
- failure precision 0/5 (DS confidence threshold 미적용)

약점 top 3 (analysis md):
- mixed crosslingual 0.39 graded NDCG — TOP weakness, bge-m3
  multilingual 한계 추정
- korean_only natural language 0.51 — query rewrite 부재 추정
- failure_expected 0/5 — confidence cutoff 부재

Phase 2 dispatch 권고 (analysis md):
- 2A Embedding bge-m3 — 즉시 진입 (mixed/korean 동시 타격)
- 2B Reranker — M (2A 이후)
- 2C OCR-Marker — 선행 chore (OCR 식별 컬럼 추가) 필요
- 2D STT — 본 평가셋 외 (별 평가셋 필요)

Query rewrite 는 Phase 2Q/Search-PR 로 별도 분리.

영향 받는 파일:
- tests/search_eval/queries.yaml: 23 → 51 case (기존 23 변경 0, append only)
- tests/search_eval/baselines/v0_2_baseline_2026-05-23.json: 신규
- tests/search_eval/baselines/v0_2_baseline_2026-05-23_analysis.md: 신규

PR plan: ~/.claude/plans/pr-2-serialized-hummingbird.md
Phase 1 plan: ~/.claude/plans/phase-1-graded-eval-v0-2.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 03:32:55 +00:00
hyungi 725a4e1f1d feat(eval): v0.2 graded relevance schema + harness
queries.yaml v0.1 23 case → v0.2 schema swap:
- 7 카테고리 (standards / korean_only / english_only / mixed / exam /
  ocr_derived / failure_expected)
- language / ocr_derived / failure_expected / graded_relevance 컬럼 추가
- v0.1 호환 보존 (legacy_category + relevant_ids + top3_ids)
- 신규 28 case (50+ 목표) 는 후속 PR-Eval-V0_2-Baseline-Analysis

run_eval.py 확장:
- graded_ndcg_at_k / graded_recall_at_k 함수 추가
- Query / QueryResult dataclass 확장 (v0.2 컬럼)
- load_queries v0.1 fallback (top3 → grade 3, 나머지 → grade 2)
- --eval-version v0.1/v0.2/both flag (default both)
- print_summary 의 by_language / by_ocr_derived 집계 추가
- write_csv 의 graded 컬럼 추가

README.md 신규:
- graded 등급 정의 (0~3) + 카테고리 정의 (7개)
- v0.2 schema 컬럼 + 신규 case 작성 가이드
- v0.1 호환성 + CLI 사용 예 + baseline 박제 정책

Phase 1 plan: ~/.claude/plans/phase-1-graded-eval-v0-2.md
Parent: ~/.claude/plans/peppy-hugging-nest.md § Phase 1

본 PR closure: schema + harness + README. 신규 28 case + baseline 박제 +
약점 분석 (embedding-sensitive failure pattern 4 카테고리 식별) 은 후속 PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 01:21:06 +00:00
Hyungi Ahn 51a6f7c9af feat(eval): 발주건 단위 baseline 평가 경로 추가
- run_eval.py: --queries-order / --order-groups / --output-order / --debug
  옵션 추가. 기존 legacy CSV 스키마/값 불변 (출력 소비자 보호).
- Tier 1A/1B/2 지표 구현: cross_format_link_success (top-10 공식 +
  top-5 보조, eligible/success 분수), top_5_document_match (guardrail +
  절대 건수), manual_refind_flag (v0 heuristic), chunk_idx_stddev,
  range/page_citation_available capability flags.
- order_groups.yaml: 발주건 3건 매핑 (TKP-26-0114/0132/0112, 10 docs).
- queries_order_baseline.yaml: 12개 질문 (A:4 B:4 C:3 D:1).

plan: ~/.claude/plans/merry-yawning-owl.md
2026-04-20 15:04:39 +09:00
Hyungi Ahn 21a78fbbf0 fix(search): semaphore로 LLM concurrency=1 강제 + run_eval analyze 파라미터 추가
## 배경
1차 Phase 2.2 eval에서 발견: 23개 쿼리가 순차 호출되지만 각 request의
background analyzer task는 모두 동시에 MLX에 요청 날림 → MLX single-inference
서버 queue 폭발 → 22개가 15초 timeout. cache 채워지지 않음.

## 수정

### query_analyzer.py
 - LLM_CONCURRENCY = 1 상수 추가
 - _LLM_SEMAPHORE: lazy init asyncio.Semaphore (event loop 바인딩)
 - analyze() 내부: semaphore → timeout(실제 LLM 호출만) 이중 래핑
   semaphore 대기 시간이 timeout에 포함되지 않도록 주의

### run_eval.py
 - --analyze true|false 파라미터 추가 (Phase 2.1+ 측정용)
 - call_search / evaluate 시그니처에 analyze 전달

## 기대 효과
 - prewarm/background/동기 호출 모두 1개씩 순차 MLX 호출
 - 23개 대기 시 최악 230초 소요, 단 모두 성공해서 cache 채움
 - MLX 서버 부하 안정

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 15:12:13 +09:00
Hyungi Ahn 76e723cdb1 feat(search): Phase 1.3 TEI reranker 통합 (코드 골격)
데이터 흐름 원칙: fusion=doc 기준 / reranker=chunk 기준 — 절대 섞지 말 것.

신규/수정:
- ai/client.py: rerank() 메서드 추가 (TEI POST /rerank API)
- services/search/rerank_service.py:
    - rerank_chunks() — asyncio.Semaphore(2) + 5s soft timeout + RRF fallback
    - _make_snippet/_extract_window — title + query 중심 200~400 토큰
      (keyword 매치 없으면 첫 800자 fallback)
    - apply_diversity() — max_per_doc=2, top score>=0.90 unlimited
    - warmup_reranker() — 10회 retry + 3초 간격 (TEI 모델 로딩 대기)
    - MAX_RERANK_INPUT=200, MAX_CHUNKS_PER_DOC=2 hard cap
- services/search_telemetry.py: compute_confidence_reranked() — sigmoid score 임계값
- api/search.py:
    - ?rerank=true|false 파라미터 (기본 true, hybrid 모드만)
    - 흐름: fused_docs(limit*5) → chunks_by_doc 회수 → rerank_chunks → apply_diversity
    - text-only 매치 doc은 doc 자체를 chunk처럼 wrap (fallback)
    - rerank 활성 시 confidence는 reranker score 기반
- tests/search_eval/run_eval.py: --rerank true|false 플래그

GPU 적용 보류:
- TEI 컨테이너 추가 (docker-compose.yml) — 별도 작업
- config.yaml rerank.endpoint 갱신 — GPU 직접 (commit 없음)
- 재인덱싱 완료 후 build + warmup + 평가셋 측정
2026-04-08 12:41:47 +09:00
Hyungi Ahn 161ff18a31 feat(search): Phase 0.5 RRF fusion + 강한 신호 boost
기존 weighted-sum merge를 Reciprocal Rank Fusion으로 교체.
정확 키워드 매치에서 RRF가 평탄화되는 문제는 boost로 보완.

신규 모듈 app/services/search_fusion.py:
- FusionStrategy ABC
- LegacyWeightedSum  : 기존 _merge_results 동작 (A/B 비교용)
- RRFOnly            : 순수 RRF, k=60
- RRFWithBoost       : RRF + title/tags/법령조문/high-text-score boost (default)
- normalize_display_scores: SearchResult.score를 [0..1] 랭크 기반 정규화
  (프론트엔드가 score*100을 % 표시하므로 RRF 원본 점수 노출 시 표시 깨짐)

search.py:
- ?fusion=legacy|rrf|rrf_boost 파라미터 (default rrf_boost)
- _merge_results 제거 (LegacyWeightedSum에 흡수)
- pre-fusion confidence: hybrid는 raw text/vector 신호로 계산
  (fused score는 fusion 전략마다 스케일이 달라 일관 비교 불가)
- timing에 fusion_ms 추가
- debug notes에 fusion 전략 표시

telemetry:
- compute_confidence_hybrid(text_results, vector_results) 헬퍼
- record_search_event에 confidence override 파라미터

run_eval.py:
- --fusion CLI 옵션, call_search 쿼리 파라미터에 전달

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 08:58:33 +09:00
Hyungi Ahn 8490cfed10 test(search): Phase 0.2 평가셋 + 평가 스크립트
22개 쿼리(6개 카테고리)와 Recall/MRR/NDCG@10 + latency p50/p95
측정 스크립트 추가. wiggly-weaving-puppy 플랜 Phase 0.2 산출물.

- queries.yaml: 정확키워드/한국어자연어/crosslingual/뉴스/실패 케이스
  실제 코퍼스(2026-04-07, 753 docs) 기반 정답 doc_id 매핑
- run_eval.py: 단일 평가 + A/B 비교 모드, CSV 저장

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 08:19:38 +09:00