feat(eval): Phase 2A Diagnose Phase 3+4 — dispatcher + 3 측정 + decision (H3 bge-m3 유지)
phase-2a-embedding-diagnose.md v4 § 6 (dispatcher) + § 7 Phase 3 (51 case 측정) + § 7 Phase 4 (decision)
Round 2 review: round-2-review-mighty-starfish.md (R2-2 + R2-B1 페어 invariant + slug-based resolve)
코드 변경:
- app/services/search/retrieval_service.py:
- CANDIDATE_BACKEND_MAP allowlist (baseline / cand_me5_large_inst / cand_snowflake_l_v2)
- _resolve_backend(slug) → docs_table/chunks_table/embed_endpoint or None
- _embed_query_via_tei() — candidate TEI 엔드포인트 호출 (cache 미사용)
- _VALID_DOCS_TABLE + _VALID_CHUNKS_TABLE regex (R2-B1 2단계 gate)
- _search_vector_docs / _search_vector_chunks: docs_table/chunks_table + snapshot_*_id_max 파라미터
- search_vector + search_vector_multilingual: embedding_backend + snapshot_*_id_max 파라미터 + dispatch log
- app/services/search/search_pipeline.py: run_search() 시그니처 + 4 search_vector* 호출 threading
- app/api/search.py: 3 Query parameter + ValueError → HTTP 400 (allowed list 응답)
- tests/search_eval/run_eval.py: --embedding-backend + --snapshot-doc-id-max + --snapshot-chunk-id-max
+ call_search/call_search_full/evaluate threading + main 3 asyncio.run threading
측정 산출물 (51 case, scored=46, failure=5):
- reports/v0_2_phase2a_baseline_snapshot_2026-05-23.csv (snapshot filter 적용 production path)
- reports/v0_2_phase2a_me5_large_inst_2026-05-23.csv
- reports/v0_2_phase2a_snowflake_l_v2_2026-05-23.csv
- tests/search_eval/baselines/v0_2_phase2a_{baseline_snapshot,me5_large_inst,snowflake_l_v2}_2026-05-23.json (3개)
결과:
| Candidate | NDCG | Δ vs baseline | mixed | korean_only | p50 ms |
|------------------------------------|-----:|--------------:|------:|------------:|-------:|
| bge-m3 (baseline snapshot) | 0.659| — | 0.39 | 0.51 | 464 |
| cand_me5_large_inst | 0.477| -0.182 | 0.17 | 0.47 | 194 |
| cand_snowflake_l_v2 | 0.616| -0.043 | 0.35 | 0.52 | 254 |
Decision (H3): bge-m3 유지. 둘 다 net 회귀.
- mE5-large-instruct: 전 카테고리 회귀 (-0.182). prefix 미적용 변수 — 별 PR PR-2A-mE5-Prefix-Retry 후보.
- snowflake_l_v2: 가벼운 회귀 (-0.043). korean_only +0.01 미세 개선 신호.
- korean_only/mixed 약점 보완은 Phase 2B (Reranker) 또는 Phase 2Q (Query rewrite) 권고.
Decision report: reports/phase_2a_embedding_decision_2026-05-23.md (§ 1~8 포함, Closure gate 16 항목 모두 PASS).
후속 PR 백로그:
- PR-2A-mE5-Prefix-Retry (별 PR)
- PR-2A-Extended-Bge-Mgemma2 (별 PR, v3 결정)
- PR-2A-Cloud-Embedding-Scaffold-1 (Cohere/Voyage scaffold-only, 선택)
- PR-Search-Query-Rewrite-1 (Phase 2Q)
- PR-Search-Reranker-V2-Diagnose (Phase 2B)
- PR-2A-Chunks-Cand-Cleanup-1 (1주 후 cand 테이블 DROP)
production 영향:
- documents / document_chunks 컬럼/row 변경 0
- config.yaml 변경 0 (ollama bge-m3 unchanged)
- 추가된 endpoint = query parameter opt-in (미지정 시 production path 회귀 0)
- smoke 4건 PASS (baseline / baseline+snapshot / cand_me5 / cand_invalid → HTTP 400)
- dispatch log 박제 verify (snapshot_doc/chunk_id_max 박제)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,46 @@
|
||||
{
|
||||
"version": "v0.2-phase2a",
|
||||
"label": "baseline_snapshot",
|
||||
"date": "2026-05-23",
|
||||
"snapshot": {
|
||||
"doc_id_max": 25180,
|
||||
"chunk_id_max": 56526,
|
||||
"documents_n": 21365,
|
||||
"chunks_n": 30605
|
||||
},
|
||||
"eval_set": {
|
||||
"total_cases": 51,
|
||||
"scored_cases": 46,
|
||||
"failure_expected_cases": 5
|
||||
},
|
||||
"model_config": {
|
||||
"embedding": "BAAI/bge-m3 (production)",
|
||||
"reranker": "BAAI/bge-reranker-v2-m3",
|
||||
"search_mode": "hybrid",
|
||||
"rerank_enabled": "server_default",
|
||||
"embedding_backend": "baseline",
|
||||
"plan": "phase-2a-embedding-diagnose.md v4"
|
||||
},
|
||||
"overall": {
|
||||
"n": 46,
|
||||
"graded_ndcg_at_10": 0.659,
|
||||
"graded_recall_at_10_t2": 0.695,
|
||||
"graded_recall_at_10_t3": 0.761,
|
||||
"latency_p50_ms": 464,
|
||||
"latency_p95_ms": 1582,
|
||||
"failure_correct": "0/5"
|
||||
},
|
||||
"by_category": {
|
||||
"english_only": { "n": 9, "recall_at_10": 0.78, "ndcg_at_10": 0.71, "graded_ndcg_at_10": 0.78 },
|
||||
"exam": { "n": 7, "recall_at_10": 0.57, "ndcg_at_10": 0.62, "graded_ndcg_at_10": 0.74 },
|
||||
"korean_only": { "n": 9, "recall_at_10": 0.55, "ndcg_at_10": 0.47, "graded_ndcg_at_10": 0.51 },
|
||||
"mixed": { "n": 10, "recall_at_10": 0.38, "ndcg_at_10": 0.36, "graded_ndcg_at_10": 0.39 },
|
||||
"standards": { "n": 11, "recall_at_10": 0.91, "ndcg_at_10": 0.85, "graded_ndcg_at_10": 0.87 }
|
||||
},
|
||||
"by_language": {
|
||||
"en": { "n": 9, "recall_at_10": 0.78, "graded_ndcg_at_10": 0.78 },
|
||||
"ko": { "n": 27, "recall_at_10": 0.70, "graded_ndcg_at_10": 0.72 },
|
||||
"mixed": { "n": 10, "recall_at_10": 0.38, "graded_ndcg_at_10": 0.39 }
|
||||
},
|
||||
"raw_csv": "reports/v0_2_phase2a_baseline_snapshot_2026-05-23.csv"
|
||||
}
|
||||
@@ -0,0 +1,60 @@
|
||||
{
|
||||
"version": "v0.2-phase2a",
|
||||
"label": "cand_me5_large_inst",
|
||||
"date": "2026-05-23",
|
||||
"snapshot": {
|
||||
"doc_id_max": 25180,
|
||||
"chunk_id_max": 56526,
|
||||
"documents_n": 21365,
|
||||
"chunks_n": 30605
|
||||
},
|
||||
"eval_set": {
|
||||
"total_cases": 51,
|
||||
"scored_cases": 46,
|
||||
"failure_expected_cases": 5
|
||||
},
|
||||
"model_config": {
|
||||
"embedding": "intfloat/multilingual-e5-large-instruct",
|
||||
"dim": 1024,
|
||||
"context": 512,
|
||||
"reranker": "BAAI/bge-reranker-v2-m3",
|
||||
"search_mode": "hybrid",
|
||||
"rerank_enabled": "server_default",
|
||||
"embedding_backend": "cand_me5_large_inst",
|
||||
"endpoint": "http://embedding-cand-me5-inst:80/embed",
|
||||
"truncate": true,
|
||||
"prefix": "NOT_APPLIED — mE5-instruct 권장 'Instruct: ' query prefix 미적용 (별 PR 후보)",
|
||||
"plan": "phase-2a-embedding-diagnose.md v4"
|
||||
},
|
||||
"overall": {
|
||||
"n": 46,
|
||||
"graded_ndcg_at_10": 0.477,
|
||||
"graded_recall_at_10_t2": 0.622,
|
||||
"graded_recall_at_10_t3": 0.620,
|
||||
"latency_p50_ms": 194,
|
||||
"latency_p95_ms": 1348,
|
||||
"failure_correct": "0/5"
|
||||
},
|
||||
"by_category": {
|
||||
"english_only": { "n": 9, "recall_at_10": 0.67, "ndcg_at_10": 0.60, "graded_ndcg_at_10": 0.63 },
|
||||
"exam": { "n": 7, "recall_at_10": 0.76, "ndcg_at_10": 0.59, "graded_ndcg_at_10": 0.62 },
|
||||
"korean_only": { "n": 9, "recall_at_10": 0.66, "ndcg_at_10": 0.48, "graded_ndcg_at_10": 0.47 },
|
||||
"mixed": { "n": 10, "recall_at_10": 0.21, "ndcg_at_10": 0.19, "graded_ndcg_at_10": 0.17 },
|
||||
"standards": { "n": 11, "recall_at_10": 0.68, "ndcg_at_10": 0.55, "graded_ndcg_at_10": 0.54 }
|
||||
},
|
||||
"by_language": {
|
||||
"en": { "n": 9, "recall_at_10": 0.67, "graded_ndcg_at_10": 0.63 },
|
||||
"ko": { "n": 27, "recall_at_10": 0.69, "graded_ndcg_at_10": 0.54 },
|
||||
"mixed": { "n": 10, "recall_at_10": 0.21, "graded_ndcg_at_10": 0.17 }
|
||||
},
|
||||
"raw_csv": "reports/v0_2_phase2a_me5_large_inst_2026-05-23.csv",
|
||||
"delta_vs_baseline": {
|
||||
"graded_ndcg_at_10": -0.182,
|
||||
"mixed": -0.22,
|
||||
"korean_only": -0.04,
|
||||
"standards": -0.33,
|
||||
"english_only": -0.15,
|
||||
"exam": -0.12,
|
||||
"latency_p50_ms": -270
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,59 @@
|
||||
{
|
||||
"version": "v0.2-phase2a",
|
||||
"label": "cand_snowflake_l_v2",
|
||||
"date": "2026-05-23",
|
||||
"snapshot": {
|
||||
"doc_id_max": 25180,
|
||||
"chunk_id_max": 56526,
|
||||
"documents_n": 21365,
|
||||
"chunks_n": 30605
|
||||
},
|
||||
"eval_set": {
|
||||
"total_cases": 51,
|
||||
"scored_cases": 46,
|
||||
"failure_expected_cases": 5
|
||||
},
|
||||
"model_config": {
|
||||
"embedding": "Snowflake/snowflake-arctic-embed-l-v2.0",
|
||||
"dim": 1024,
|
||||
"context": 8192,
|
||||
"reranker": "BAAI/bge-reranker-v2-m3",
|
||||
"search_mode": "hybrid",
|
||||
"rerank_enabled": "server_default",
|
||||
"embedding_backend": "cand_snowflake_l_v2",
|
||||
"endpoint": "http://embedding-cand-snowflake-l-v2:80/embed",
|
||||
"truncate": true,
|
||||
"plan": "phase-2a-embedding-diagnose.md v4"
|
||||
},
|
||||
"overall": {
|
||||
"n": 46,
|
||||
"graded_ndcg_at_10": 0.616,
|
||||
"graded_recall_at_10_t2": 0.726,
|
||||
"graded_recall_at_10_t3": 0.728,
|
||||
"latency_p50_ms": 254,
|
||||
"latency_p95_ms": 1412,
|
||||
"failure_correct": "0/5"
|
||||
},
|
||||
"by_category": {
|
||||
"english_only": { "n": 9, "recall_at_10": 0.78, "ndcg_at_10": 0.68, "graded_ndcg_at_10": 0.74 },
|
||||
"exam": { "n": 7, "recall_at_10": 0.67, "ndcg_at_10": 0.54, "graded_ndcg_at_10": 0.56 },
|
||||
"korean_only": { "n": 9, "recall_at_10": 0.60, "ndcg_at_10": 0.50, "graded_ndcg_at_10": 0.52 },
|
||||
"mixed": { "n": 10, "recall_at_10": 0.40, "ndcg_at_10": 0.32, "graded_ndcg_at_10": 0.35 },
|
||||
"standards": { "n": 11, "recall_at_10": 0.91, "ndcg_at_10": 0.85, "graded_ndcg_at_10": 0.87 }
|
||||
},
|
||||
"by_language": {
|
||||
"en": { "n": 9, "recall_at_10": 0.78, "graded_ndcg_at_10": 0.74 },
|
||||
"ko": { "n": 27, "recall_at_10": 0.74, "graded_ndcg_at_10": 0.67 },
|
||||
"mixed": { "n": 10, "recall_at_10": 0.40, "graded_ndcg_at_10": 0.35 }
|
||||
},
|
||||
"raw_csv": "reports/v0_2_phase2a_snowflake_l_v2_2026-05-23.csv",
|
||||
"delta_vs_baseline": {
|
||||
"graded_ndcg_at_10": -0.043,
|
||||
"mixed": -0.04,
|
||||
"korean_only": +0.01,
|
||||
"standards": 0.00,
|
||||
"english_only": -0.04,
|
||||
"exam": -0.18,
|
||||
"latency_p50_ms": -210
|
||||
}
|
||||
}
|
||||
@@ -199,6 +199,9 @@ async def call_search(
|
||||
fusion: str | None = None,
|
||||
rerank: str | None = None,
|
||||
analyze: str | None = None,
|
||||
embedding_backend: str | None = None,
|
||||
snapshot_doc_id_max: int | None = None,
|
||||
snapshot_chunk_id_max: int | None = None,
|
||||
) -> tuple[list[int], float]:
|
||||
"""검색 API 호출 → (doc_ids, latency_ms)."""
|
||||
url = f"{base_url.rstrip('/')}/api/search/"
|
||||
@@ -210,6 +213,12 @@ async def call_search(
|
||||
params["rerank"] = rerank
|
||||
if analyze is not None:
|
||||
params["analyze"] = analyze
|
||||
if embedding_backend is not None:
|
||||
params["embedding_backend"] = embedding_backend
|
||||
if snapshot_doc_id_max is not None:
|
||||
params["snapshot_doc_id_max"] = snapshot_doc_id_max
|
||||
if snapshot_chunk_id_max is not None:
|
||||
params["snapshot_chunk_id_max"] = snapshot_chunk_id_max
|
||||
|
||||
import time
|
||||
|
||||
@@ -237,6 +246,9 @@ async def evaluate(
|
||||
fusion: str | None = None,
|
||||
rerank: str | None = None,
|
||||
analyze: str | None = None,
|
||||
embedding_backend: str | None = None,
|
||||
snapshot_doc_id_max: int | None = None,
|
||||
snapshot_chunk_id_max: int | None = None,
|
||||
) -> list[QueryResult]:
|
||||
"""전체 쿼리셋 평가."""
|
||||
results: list[QueryResult] = []
|
||||
@@ -245,7 +257,10 @@ async def evaluate(
|
||||
for q in queries:
|
||||
try:
|
||||
returned_ids, latency_ms = await call_search(
|
||||
client, base_url, token, q.query, mode=mode, fusion=fusion, rerank=rerank, analyze=analyze
|
||||
client, base_url, token, q.query, mode=mode, fusion=fusion, rerank=rerank, analyze=analyze,
|
||||
embedding_backend=embedding_backend,
|
||||
snapshot_doc_id_max=snapshot_doc_id_max,
|
||||
snapshot_chunk_id_max=snapshot_chunk_id_max,
|
||||
)
|
||||
results.append(
|
||||
QueryResult(
|
||||
@@ -819,6 +834,9 @@ async def call_search_full(
|
||||
rerank: str | None = None,
|
||||
analyze: str | None = None,
|
||||
debug: bool = False,
|
||||
embedding_backend: str | None = None,
|
||||
snapshot_doc_id_max: int | None = None,
|
||||
snapshot_chunk_id_max: int | None = None,
|
||||
) -> tuple[list[dict], float]:
|
||||
"""call_search와 동일 로직. 단 full result dict 리스트 반환."""
|
||||
url = f"{base_url.rstrip('/')}/api/search/"
|
||||
@@ -832,6 +850,12 @@ async def call_search_full(
|
||||
params["analyze"] = analyze
|
||||
if debug:
|
||||
params["debug"] = "true"
|
||||
if embedding_backend is not None:
|
||||
params["embedding_backend"] = embedding_backend
|
||||
if snapshot_doc_id_max is not None:
|
||||
params["snapshot_doc_id_max"] = snapshot_doc_id_max
|
||||
if snapshot_chunk_id_max is not None:
|
||||
params["snapshot_chunk_id_max"] = snapshot_chunk_id_max
|
||||
|
||||
import time
|
||||
|
||||
@@ -1266,6 +1290,25 @@ def main() -> int:
|
||||
choices=["v0.1", "v0.2", "both"],
|
||||
help="점수 출력 모드 (Phase 1, default both). v0.1=binary only / v0.2=graded only / both=둘 다",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--embedding-backend",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Phase 2A Diagnose dispatcher slug (baseline | cand_me5_large_inst | cand_snowflake_l_v2). 미지정 = production.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--snapshot-doc-id-max",
|
||||
type=int,
|
||||
default=None,
|
||||
help="Phase 2A snapshot freeze. documents.id <= 값 filter. baseline rebaseline 도 동일 적용.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--snapshot-chunk-id-max",
|
||||
type=int,
|
||||
default=None,
|
||||
help="Phase 2A snapshot freeze. document_chunks.id <= 값 filter. baseline rebaseline 도 동일 적용.",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.token:
|
||||
@@ -1318,21 +1361,21 @@ def main() -> int:
|
||||
if args.base_url:
|
||||
print(f"\n>>> evaluating: {args.base_url}")
|
||||
results = asyncio.run(
|
||||
evaluate(queries, args.base_url, args.token, "single", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze)
|
||||
evaluate(queries, args.base_url, args.token, "single", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze, embedding_backend=args.embedding_backend, snapshot_doc_id_max=args.snapshot_doc_id_max, snapshot_chunk_id_max=args.snapshot_chunk_id_max)
|
||||
)
|
||||
print_summary("single", results, eval_version=args.eval_version)
|
||||
all_results.extend(results)
|
||||
else:
|
||||
print(f"\n>>> baseline: {args.baseline_url}")
|
||||
baseline_results = asyncio.run(
|
||||
evaluate(queries, args.baseline_url, args.token, "baseline", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze)
|
||||
evaluate(queries, args.baseline_url, args.token, "baseline", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze, embedding_backend=args.embedding_backend, snapshot_doc_id_max=args.snapshot_doc_id_max, snapshot_chunk_id_max=args.snapshot_chunk_id_max)
|
||||
)
|
||||
baseline_summary = print_summary("baseline", baseline_results, eval_version=args.eval_version)
|
||||
|
||||
print(f"\n>>> candidate: {args.candidate_url}")
|
||||
candidate_results = asyncio.run(
|
||||
evaluate(
|
||||
queries, args.candidate_url, args.token, "candidate", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze
|
||||
queries, args.candidate_url, args.token, "candidate", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze, embedding_backend=args.embedding_backend, snapshot_doc_id_max=args.snapshot_doc_id_max, snapshot_chunk_id_max=args.snapshot_chunk_id_max
|
||||
)
|
||||
)
|
||||
candidate_summary = print_summary("candidate", candidate_results, eval_version=args.eval_version)
|
||||
|
||||
Reference in New Issue
Block a user