fix(search): semaphore로 LLM concurrency=1 강제 + run_eval analyze 파라미터 추가

## 배경
1차 Phase 2.2 eval에서 발견: 23개 쿼리가 순차 호출되지만 각 request의
background analyzer task는 모두 동시에 MLX에 요청 날림 → MLX single-inference
서버 queue 폭발 → 22개가 15초 timeout. cache 채워지지 않음.

## 수정

### query_analyzer.py
 - LLM_CONCURRENCY = 1 상수 추가
 - _LLM_SEMAPHORE: lazy init asyncio.Semaphore (event loop 바인딩)
 - analyze() 내부: semaphore → timeout(실제 LLM 호출만) 이중 래핑
   semaphore 대기 시간이 timeout에 포함되지 않도록 주의

### run_eval.py
 - --analyze true|false 파라미터 추가 (Phase 2.1+ 측정용)
 - call_search / evaluate 시그니처에 analyze 전달

## 기대 효과
 - prewarm/background/동기 호출 모두 1개씩 순차 MLX 호출
 - 23개 대기 시 최악 230초 소요, 단 모두 성공해서 cache 채움
 - MLX 서버 부하 안정

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hyungi Ahn
2026-04-08 15:12:13 +09:00
parent f5c3dea833
commit 21a78fbbf0
2 changed files with 38 additions and 9 deletions

View File

@@ -134,6 +134,7 @@ async def call_search(
limit: int = 20,
fusion: str | None = None,
rerank: str | None = None,
analyze: str | None = None,
) -> tuple[list[int], float]:
"""검색 API 호출 → (doc_ids, latency_ms)."""
url = f"{base_url.rstrip('/')}/api/search/"
@@ -143,6 +144,8 @@ async def call_search(
params["fusion"] = fusion
if rerank is not None:
params["rerank"] = rerank
if analyze is not None:
params["analyze"] = analyze
import time
@@ -169,6 +172,7 @@ async def evaluate(
mode: str = "hybrid",
fusion: str | None = None,
rerank: str | None = None,
analyze: str | None = None,
) -> list[QueryResult]:
"""전체 쿼리셋 평가."""
results: list[QueryResult] = []
@@ -177,7 +181,7 @@ async def evaluate(
for q in queries:
try:
returned_ids, latency_ms = await call_search(
client, base_url, token, q.query, mode=mode, fusion=fusion, rerank=rerank
client, base_url, token, q.query, mode=mode, fusion=fusion, rerank=rerank, analyze=analyze
)
results.append(
QueryResult(
@@ -415,6 +419,13 @@ def main() -> int:
choices=["true", "false"],
help="bge-reranker-v2-m3 활성화 (Phase 1.3+, 미지정 시 서버 기본값=true)",
)
parser.add_argument(
"--analyze",
type=str,
default=None,
choices=["true", "false"],
help="QueryAnalyzer 활성화 (Phase 2.1+, cache hit 시 multilingual 적용)",
)
parser.add_argument(
"--token",
type=str,
@@ -454,21 +465,21 @@ def main() -> int:
if args.base_url:
print(f"\n>>> evaluating: {args.base_url}")
results = asyncio.run(
evaluate(queries, args.base_url, args.token, "single", mode=args.mode, fusion=args.fusion, rerank=args.rerank)
evaluate(queries, args.base_url, args.token, "single", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze)
)
print_summary("single", results)
all_results.extend(results)
else:
print(f"\n>>> baseline: {args.baseline_url}")
baseline_results = asyncio.run(
evaluate(queries, args.baseline_url, args.token, "baseline", mode=args.mode, fusion=args.fusion, rerank=args.rerank)
evaluate(queries, args.baseline_url, args.token, "baseline", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze)
)
baseline_summary = print_summary("baseline", baseline_results)
print(f"\n>>> candidate: {args.candidate_url}")
candidate_results = asyncio.run(
evaluate(
queries, args.candidate_url, args.token, "candidate", mode=args.mode, fusion=args.fusion, rerank=args.rerank
queries, args.candidate_url, args.token, "candidate", mode=args.mode, fusion=args.fusion, rerank=args.rerank, analyze=args.analyze
)
)
candidate_summary = print_summary("candidate", candidate_results)