Files

Hyungi Ahn ec36ea3d6d test(search): Phase 0.2 baseline 측정 결과

23개 쿼리에 대한 현재 검색(FTS+ILIKE+Vector hybrid) baseline.
Phase 1+ 개선 비교 기준점으로 보존.

전체: Recall@10 0.788 / NDCG@10 0.705 / Top-3 0.95 / p95 1695ms

핵심 약점 (Phase 1+ 타겟):
- news_crosslingual catastrophic (Recall 0.14) → domain-aware 필수
- failure-case precision 0/3 → confidence threshold 부재
- p95 1695ms (목표 500ms의 3배) → trigram/parallel retrieval
- nl 쿼리 top-3 ordering 약함 → chunk-level + reranker

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-07 08:22:53 +09:00

3.5 KiB

Raw Blame History

Search Eval — Baseline 2026-04-07

Phase 0.2 완료 시점의 baseline 측정. Phase 1+ 개선 비교 기준점.

평가셋: tests/search_eval/queries.yaml v0.1 (23개 쿼리)
평가 스크립트: tests/search_eval/run_eval.py
API: 현재 운영 검색 (FTS + ILIKE + Vector 가중합 hybrid mode)
코퍼스: 753 documents (2026-04-07)
실행 환경: GPU 서버 fastapi 컨테이너 (http://localhost:8000)

전체 지표 (scored=20, failure=3 제외)

지표	값
Recall@10	0.788
MRR@10	0.751
NDCG@10	0.705
Top-3 hit rate	0.950
Latency p50	544 ms
Latency p95	1695 ms
Failure-case precision	0.00 (0/3)

카테고리별 (Recall@10 / NDCG@10)

카테고리	n	Recall@10	NDCG@10	비고
exact_keyword	5	1.00	1.00	FTS가 키워드는 완벽히 잡음
other_domain (공업역학)	2	1.00	0.80
crosslingual_ko_en	3	0.92	0.74	bge-m3 임베딩 효과
natural_language_ko	5	0.73	0.68	chunking + reranker로 개선 여지
news_fr (Le Monde)	1	0.75	0.82
news_ko (경향)	2	0.56	0.37	top-3 ordering 약함
news_en (Der Spiegel EN)	1	0.33	0.23
news_crosslingual	1	0.14	0.08	catastrophic — domain-aware 필수

주요 약점 (Phase 1+ 개선 타겟)

1. Failure-case 처리 부재 (0/3)

"Rust async runtime tokio", "양자컴퓨터 큐비트", "재즈 보컬리스트 빌리 홀리데이" 세 쿼리 모두 코퍼스에 정답 0건이지만 vector 유사도가 항상 무언가 반환.
현재 API에 confidence threshold 없음.
→ Phase 0.3 search_failure_logs, Phase 2 confidence 3단계 fallback, Phase 3 confidence 응답 필드 필요.

2. 다국어 뉴스 검색 catastrophic (Recall 0.14)

한국어 쿼리 "이란 미국 전쟁 글로벌 반응"으로 7개 다국어 뉴스 기대 → 1건만 회수.
현재 vector embedding이 한국어 corpus 쪽으로 강하게 bias.
→ Phase 1 domain-aware retrieval 분기, Phase 2 normalized_queries 배열 + multilingual tier 전략 필요.

3. Latency p95 1695ms (목표 500ms의 3배)

exact_keyword 쿼리에서 1.5–1.8초 자주 발생 (kw_001, kw_003, kw_004).
ILIKE %q% 전수 스캔이 주범으로 추정. trigram 인덱스 미활용.
→ Phase 1 trigram 제대로 사용 (similarity 연산자), parallel retrieval (asyncio.gather) 필요.

4. natural_language_ko top-3 정확도 약함

nl_003: MRR 0.333 (정답 첫 hit가 rank 3) — 첨가적 정답이 위로 못 올라옴.
nl_005: Recall 0.5 (시행령 3865 누락) — chunk 단위 검색 부재 영향.
→ Phase 1 chunk 기반 retrieval + reranker 필요.

강점 (이미 잘 동작)

정확 키워드 검색은 완벽 (FTS의 본래 강점)
한→영 crosslingual은 bge-m3 덕분에 이미 0.92 Recall
top-3 hit rate 95% (대부분 첫 페이지 안에는 답이 들어옴)
공업역학 같은 다른 도메인도 의미 검색 잘 동작

다음 단계 (실행 순서 — wiggly-weaving-puppy 플랜)

Phase 0.3 search_failure_logs 테이블
Phase 0.4 debug 응답 옵션
Phase 0.5 RRF fusion
Phase 1 reranker + chunk-level retrieval
Phase 1 완료 후 동일 평가셋 재실행 → 본 baseline과 비교
Phase 2 QueryAnalyzer (multilingual + domain_hint)
Phase 2 완료 후 평가셋 재실행 — news_crosslingual이 가장 큰 개선 기대

3.5 KiB Raw Blame History Unescape Escape

Search Eval — Baseline 2026-04-07

전체 지표 (scored=20, failure=3 제외)

카테고리별 (Recall@10 / NDCG@10)

주요 약점 (Phase 1+ 개선 타겟)

1. Failure-case 처리 부재 (0/3)

2. 다국어 뉴스 검색 catastrophic (Recall 0.14)

3. Latency p95 1695ms (목표 500ms의 3배)

4. natural_language_ko top-3 정확도 약함

강점 (이미 잘 동작)

다음 단계 (실행 순서 — wiggly-weaving-puppy 플랜)

3.5 KiB

Raw Blame History