diff --git a/reports/baseline_2026-04-07.csv b/reports/baseline_2026-04-07.csv new file mode 100644 index 0000000..d700e31 --- /dev/null +++ b/reports/baseline_2026-04-07.csv @@ -0,0 +1,24 @@ +label,id,category,intent,domain_hint,query,relevant_ids,returned_ids_top10,latency_ms,recall_at_10,mrr_at_10,ndcg_at_10,top3_hit,error +single,kw_001,exact_keyword,fact_lookup,document,산업안전보건법 제6장,3856;3868;3879,3856;3868;3879;3912;3950;3908;3915;3911;3858;3853,1696.5,1.000,1.000,1.000,1, +single,kw_002,exact_keyword,fact_lookup,document,중대재해 처벌 등에 관한 법률 제2장 중대산업재해,3917;3921,3921;3917;3916;3922;3920;3918;3919;3923;3874;3946,567.9,1.000,1.000,1.000,1, +single,kw_003,exact_keyword,fact_lookup,document,화학물질관리법 유해화학물질 영업자,3981,3981;3980;3857;3988;3869;3880;3978;3986;3985;3979,1780.5,1.000,1.000,1.000,1, +single,kw_004,exact_keyword,fact_lookup,document,근로기준법 안전과 보건,4041,4041;3853;3860;3883;3858;3865;4036;3885;3901;3888,1686.7,1.000,1.000,1.000,1, +single,kw_005,exact_keyword,fact_lookup,document,산업안전보건기준에 관한 규칙 보호구,3888,3888;3905;3908;3909;3897;3911;3907;3912;3903;3906,539.6,1.000,1.000,1.000,1, +single,nl_001,natural_language_ko,semantic_search,document,기계로 인한 산업재해 관련 법령,3856;3868;3879;3854,3868;3856;3879;3895;3915;3872;3851;3867;3897;3863,545.0,0.750,1.000,0.832,1, +single,nl_002,natural_language_ko,semantic_search,document,사업주가 도급을 줄 때 산업재해를 예방하기 위해 해야 할 일,3855;3867;3878,3855;3867;3878;3863;3917;3872;3854;3896;3861;3886,552.1,1.000,1.000,1.000,1, +single,nl_003,natural_language_ko,semantic_search,document,유해화학물질을 다루는 회사가 지켜야 할 안전 의무,3980;3981;3982,3857;3869;3980;3880;3896;3903;3854;3981;3909;3904,534.2,0.667,0.333,0.383,1, +single,nl_004,natural_language_ko,semantic_search,document,중대재해가 발생했을 때 경영책임자가 처벌받는 기준,3916;3917;3920;3921,3918;3917;3919;3921;3916;3923;3867;3922;3877;3984,529.4,0.750,0.500,0.565,1, +single,nl_005,natural_language_ko,semantic_search,document,안전보건교육은 누가 받아야 하고 어떤 내용을 다루는가,3853;3865,3853;3876;3965;3871;3958;3875;3861;3866;3877;3856,543.4,0.500,1.000,0.613,1, +single,cl_001,crosslingual_ko_en,semantic_search,document,기계 안전 가드 설계 원리,3770;3856,3895;3770;3762;3773;3879;3856;3767;3788;3868;3897,532.2,1.000,0.500,0.605,1, +single,cl_002,crosslingual_ko_en,semantic_search,document,산업 안전 입문서,3755;3775;3776;3777,3755;3816;3775;3851;3896;3853;3876;3871;3776;3863,1446.3,0.750,1.000,0.703,1, +single,cl_003,crosslingual_ko_en,semantic_search,document,전기 안전 위험,3772;3790,3772;3897;3790;4024;4018;4020;4023;4022;4013;4019,1364.3,1.000,1.000,0.920,1, +single,news_001,news_ko,semantic_search,news,이란과 미국의 군사 충돌,4303;4304;4307;4316;4322;4323;4327;4335,4452;4307;4317;4321;4339;4331;4329;4418;4446;4459,535.4,0.125,0.500,0.160,1, +single,news_002,news_ko,semantic_search,news,호르무즈 해협 봉쇄,4316;4320;4322;4327,4349;4199;4346;4320;4322;4327;4340;4304;4316;4260,538.4,1.000,0.250,0.576,0, +single,news_003,news_en,semantic_search,news,Trump Iran ultimatum,4258;4260;4262,4519;4202;4258;4321;4333;4515;4313;4445;4418;4314,533.7,0.333,0.333,0.235,1, +single,news_004,news_fr,semantic_search,news,guerre en Iran,4199;4202;4210;4361;4363;4507;4519;4521,4199;4507;4521;4363;4519;4211;4258;4324;4210;4536,523.8,0.750,1.000,0.822,1, +single,news_005,news_crosslingual,semantic_search,news,이란 미국 전쟁 글로벌 반응,4202;4258;4262;4536;4303;4304;4316,4329;4457;4307;4345;4324;4452;4443;4444;4450;4262,1483.2,0.143,0.100,0.079,1, +single,misc_001,other_domain,fact_lookup,document,강체의 평면 운동학,4063;4065,4071;4063;4064;4058;4066;4065;4068;4060;4062;4059,568.0,1.000,0.500,0.605,1, +single,misc_002,other_domain,semantic_search,document,질점의 운동역학,4060;4061;4062,4062;4060;4061;4058;4070;4059;4069;4071;4063;4067,552.0,1.000,1.000,1.000,1, +single,fail_001,failure_expected,semantic_search,document,Rust async runtime tokio scheduler 내부 구조,,4069;3789;4067;4070;4060;4061;4071;4062;3807;4433,543.8,0.000,0.000,0.000,1, +single,fail_002,failure_expected,semantic_search,document,양자컴퓨터 큐비트 디코히어런스,,4068;4058;4064;4060;4065;4063;4061;3899;4067;4196,527.0,0.000,0.000,0.000,1, +single,fail_003,failure_expected,semantic_search,news,재즈 보컬리스트 빌리 홀리데이,,4289;4281;4205;4116;4100;4077;4316;4343;4235;4504,533.0,0.000,0.000,0.000,1, diff --git a/reports/baseline_2026-04-07_summary.md b/reports/baseline_2026-04-07_summary.md new file mode 100644 index 0000000..4e186de --- /dev/null +++ b/reports/baseline_2026-04-07_summary.md @@ -0,0 +1,74 @@ +# Search Eval — Baseline 2026-04-07 + +Phase 0.2 완료 시점의 baseline 측정. Phase 1+ 개선 비교 기준점. + +- 평가셋: `tests/search_eval/queries.yaml` v0.1 (23개 쿼리) +- 평가 스크립트: `tests/search_eval/run_eval.py` +- API: 현재 운영 검색 (FTS + ILIKE + Vector 가중합 hybrid mode) +- 코퍼스: 753 documents (2026-04-07) +- 실행 환경: GPU 서버 fastapi 컨테이너 (`http://localhost:8000`) + +## 전체 지표 (scored=20, failure=3 제외) + +| 지표 | 값 | +|---|---| +| Recall@10 | **0.788** | +| MRR@10 | **0.751** | +| NDCG@10 | **0.705** | +| Top-3 hit rate | **0.950** | +| Latency p50 | **544 ms** | +| Latency p95 | **1695 ms** | +| Failure-case precision | **0.00 (0/3)** | + +## 카테고리별 (Recall@10 / NDCG@10) + +| 카테고리 | n | Recall@10 | NDCG@10 | 비고 | +|---|---|---|---|---| +| exact_keyword | 5 | **1.00** | **1.00** | FTS가 키워드는 완벽히 잡음 | +| other_domain (공업역학) | 2 | 1.00 | 0.80 | | +| crosslingual_ko_en | 3 | 0.92 | 0.74 | bge-m3 임베딩 효과 | +| natural_language_ko | 5 | 0.73 | 0.68 | chunking + reranker로 개선 여지 | +| news_fr (Le Monde) | 1 | 0.75 | 0.82 | | +| news_ko (경향) | 2 | 0.56 | 0.37 | top-3 ordering 약함 | +| news_en (Der Spiegel EN) | 1 | 0.33 | 0.23 | | +| **news_crosslingual** | 1 | **0.14** | **0.08** | **catastrophic — domain-aware 필수** | + +## 주요 약점 (Phase 1+ 개선 타겟) + +### 1. Failure-case 처리 부재 (0/3) +- "Rust async runtime tokio", "양자컴퓨터 큐비트", "재즈 보컬리스트 빌리 홀리데이" + 세 쿼리 모두 코퍼스에 정답 0건이지만 vector 유사도가 항상 무언가 반환. +- 현재 API에 confidence threshold 없음. +- → **Phase 0.3 search_failure_logs**, **Phase 2 confidence 3단계 fallback**, **Phase 3 confidence 응답 필드** 필요. + +### 2. 다국어 뉴스 검색 catastrophic (Recall 0.14) +- 한국어 쿼리 "이란 미국 전쟁 글로벌 반응"으로 7개 다국어 뉴스 기대 → 1건만 회수. +- 현재 vector embedding이 한국어 corpus 쪽으로 강하게 bias. +- → **Phase 1 domain-aware retrieval 분기**, **Phase 2 normalized_queries 배열 + multilingual tier 전략** 필요. + +### 3. Latency p95 1695ms (목표 500ms의 3배) +- exact_keyword 쿼리에서 1.5–1.8초 자주 발생 (kw_001, kw_003, kw_004). +- ILIKE `%q%` 전수 스캔이 주범으로 추정. trigram 인덱스 미활용. +- → **Phase 1 trigram 제대로 사용 (similarity 연산자)**, **parallel retrieval (asyncio.gather)** 필요. + +### 4. natural_language_ko top-3 정확도 약함 +- nl_003: MRR 0.333 (정답 첫 hit가 rank 3) — 첨가적 정답이 위로 못 올라옴. +- nl_005: Recall 0.5 (시행령 3865 누락) — chunk 단위 검색 부재 영향. +- → **Phase 1 chunk 기반 retrieval + reranker** 필요. + +## 강점 (이미 잘 동작) + +- 정확 키워드 검색은 완벽 (FTS의 본래 강점) +- 한→영 crosslingual은 bge-m3 덕분에 이미 0.92 Recall +- top-3 hit rate 95% (대부분 첫 페이지 안에는 답이 들어옴) +- 공업역학 같은 다른 도메인도 의미 검색 잘 동작 + +## 다음 단계 (실행 순서 — wiggly-weaving-puppy 플랜) + +1. Phase 0.3 search_failure_logs 테이블 +2. Phase 0.4 debug 응답 옵션 +3. Phase 0.5 RRF fusion +4. Phase 1 reranker + chunk-level retrieval +5. **Phase 1 완료 후 동일 평가셋 재실행 → 본 baseline과 비교** +6. Phase 2 QueryAnalyzer (multilingual + domain_hint) +7. **Phase 2 완료 후 평가셋 재실행** — news_crosslingual이 가장 큰 개선 기대