fix(search): LLM_TIMEOUT_MS 5000 → 15000 (실측 반영)

축소 프롬프트 재측정: - prompt_tok 2406 → 802 (1/3 감소 성공) - latency 10.5초 → 7~11초 (generation이 dominant) - max_tokens 내려도 무효 (자연 EOS ~289 tok) 5000ms로는 여전히 모든 prewarm timeout. async 구조이므로 background에서 15초 기다려도 retrieval 경로 영향 0. 추가: prewarm delay_between 0.5 → 0.2 (총 prewarm 시간 단축). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 14:50:56 +09:00
parent c81b728ddf
commit 324537cbc8
1 changed files with 5 additions and 2 deletions
--- a/app/services/search/query_analyzer.py
+++ b/app/services/search/query_analyzer.py
@@ -42,7 +42,10 @@ logger = logging.getLogger("query_analyzer")
 PROMPT_VERSION = "v2"  # prompts/query_analyze.txt 축소판
 CACHE_TTL = 86400  # 24h
 CACHE_MAXSIZE = 1000
-LLM_TIMEOUT_MS = 5000  # async 구조 (background), 동기 경로 금지
+LLM_TIMEOUT_MS = 15000  # async 구조 (background), 동기 경로 금지
 # ↑ 실측: gemma-4-26b-a4b-it-8bit MLX, 축소 프롬프트(prompt_tok=802) 7~11초.
 #   generation이 dominant (max_tokens 무효, 자연 EOS ~289 tok 생성).
 #   background 실행이라 15초도 안전. 상향 필요 시 여기서만 조정.
 MIN_CACHE_CONFIDENCE = 0.5  # 이 미만은 캐시 금지
 MAX_NORMALIZED_QUERIES = 3
@@ -350,7 +353,7 @@ DEFAULT_PREWARM_QUERIES: list[str] = [
 async def prewarm_analyzer(
    queries: list[str] | None = None,
-    delay_between: float = 0.5,
+    delay_between: float = 0.2,
 ) -> dict[str, Any]:
    """app startup에서 호출. 대표 쿼리를 미리 분석해 cache에 적재.