feat(ai): B-0 3-tier routing — triage/primary/fallback 슬롯 + AIClient

- config.yaml: ai.models 에 triage (gemma4:e4b-it-q8_0, GPU Ollama, context_char_limit=120k, timeout 30s) 신규. primary (MLX gemma-4-26b) 는 에스컬레이션 전용 역할 명시. fallback 을 gemma4:e4b 로 통일 (exaone 제거 이미 반영). classifier/verifier 는 optional 유지, vision 은 optional 로 완화 (미사용 정리 준비). - core/config.py: AIConfig 에 triage 필드 추가, vision 은 Optional 로 전환. AIModelConfig.context_char_limit + DeepSummaryBacklogConfig (R2 backlog guard 임계치 ratio 0.3 / pending 5 / window 30min) 스키마 신설. load_settings 가 models.get("vision") graceful. - ai/client.py: call_triage / call_primary / call_fallback 3-tier 진입점 신규. primary 는 caller 가 get_mlx_gate() 블록 안에서 호출 해야 한다는 계약 docstring. classify/summarize 는 DEPRECATED 주석 만 추가, 기존 호출부 (eval runner 등) 를 위해 유지. PR-B B-0 Day 1. 기존 primary 경로 변경 없음 — 회귀 0 기대. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 10:05:24 +09:00
parent 628d886cba
commit 490bef1136
3 changed files with 102 additions and 29 deletions
@@ -52,20 +52,56 @@ CLASSIFY_PROMPT = _load_prompt("classify.txt") if (PROMPTS_DIR / "classify.txt")


 class AIClient:
-    """AI Gateway를 통한 통합 클라이언트. 기본값은 항상 Qwen3.5."""
+    """AI 모델 통합 클라이언트.
+
+    B-0 3-tier routing:
+      - call_triage(): 4B Ollama, 상시 호출 (llm_gate 외부 — 병렬 OK)
+      - call_primary(): 26B MLX, 에스컬레이션 전용 (llm_gate Semaphore(1) 는 **caller 책임**)
+      - call_fallback(): triage/primary 실패 시 최후 방어선 (현재 4B 동일)
+
+    Legacy: classify() / summarize() 는 기존 호출부(tests/eval runner)를 위해 남겨둠.
+    신규 worker 경로는 전부 call_triage / call_primary 사용.
+    """

    def __init__(self):
        self.ai = settings.ai
        self._http = httpx.AsyncClient(timeout=120)

+    # ─── 3-tier routing (B-0) ───────────────────────────────────────────────
+
+    async def call_triage(self, prompt: str) -> str:
+        """4B Ollama 직접 호출. llm_gate 밖 (Ollama 는 concurrent OK).
+
+        timeout 은 config.yaml ai.models.triage.timeout (기본 30s).
+        실패 시 caller 가 에스컬레이션 또는 fallback 판단.
+        """
+        return await self._request(self.ai.triage, prompt)
+
+    async def call_primary(self, prompt: str) -> str:
+        """26B MLX 호출. 에스컬레이션 전용.
+
+        **caller 가 반드시 `async with get_mlx_gate():` 블록 안에서 호출해야 한다.**
+        Semaphore(1) 로 동시 호출이 1건으로 제한되어 있고, gate 는 primary 전용.
+        """
+        return await self._request(self.ai.primary, prompt)
+
+    async def call_fallback(self, prompt: str) -> str:
+        """triage/primary 실패 시 최후 방어선. 현재는 triage 와 동일 엔드포인트."""
+        return await self._request(self.ai.fallback, prompt)
+
+    # ─── Legacy API (classify_worker 교체 시 제거 예정) ───────────────────
+
    async def classify(self, text: str) -> dict:
-        """문서 분류 — 항상 primary(Qwen3.5) 사용"""
+        """[DEPRECATED] 기존 classify_worker 전용. B-1 에서 summary_triage 로 대체.
+
+        호출부 정리 전 존속. 신규 코드는 call_triage + prompt_render 를 쓸 것.
+        """
        prompt = CLASSIFY_PROMPT.replace("{document_text}", text)
        response = await self._call_chat(self.ai.primary, prompt)
        return response

    async def summarize(self, text: str, force_premium: bool = False) -> str:
-        """문서 요약 — 기본 primary, force_premium=True 시만 Claude"""
+        """[DEPRECATED] 기존 호출부용. B-1 에서 summary_triage 가 tldr 대체."""
        if force_premium:
            return await self._call_chat(self.ai.premium, f"다음 문서를 500자 이내로 요약해주세요:\n\n{text}")
        return await self._call_chat(self.ai.primary, f"다음 문서를 500자 이내로 요약해주세요:\n\n{text}")
@@ -23,20 +23,36 @@ class AIModelConfig(BaseModel):
    timeout: int = 60
    daily_budget_usd: float | None = None
    require_explicit_trigger: bool = False
+    # B-0: 4B/26B 에 부여한 실사용 컨텍스트 상한 (char). triage=120k, primary=260k.
+    # classify_worker 가 에스컬레이션 판정 시 참고. 0/None 이면 상한 무시.
+    context_char_limit: int | None = None
+
+
+class DeepSummaryBacklogConfig(BaseModel):
+    """B-1 R2 — deep_summary enqueue 폭발 억제 임계치."""
+    ratio_threshold: float = 0.3     # 지난 window 의 deep_n/classify_n
+    pending_threshold: int = 5       # deep_summary pending+processing
+    window_minutes: int = 30


 class AIConfig(BaseModel):
    gateway_endpoint: str
+    # B-0: 3-tier routing. triage(4B) 상시, primary(26B) escalation-only, fallback(4B) 최후.
+    triage: AIModelConfig
    primary: AIModelConfig
    fallback: AIModelConfig
    premium: AIModelConfig
    embedding: AIModelConfig
-    vision: AIModelConfig
    rerank: AIModelConfig
    # Phase 3.5a: exaone classifier (optional — 없으면 score-only gate)
    classifier: AIModelConfig | None = None
    # Phase 3.5b: exaone verifier (optional — 없으면 grounding-only)
    verifier: AIModelConfig | None = None
+    # Legacy: vision 슬롯 (현재 사용처 0 — Document Server 는 OCR/STT 별도 서비스).
+    # 제거 진행 중이므로 optional 로 관대한 로딩 유지.
+    vision: AIModelConfig | None = None
+    # B-1 R2: backlog guard 임계치
+    deep_summary_backlog: DeepSummaryBacklogConfig = DeepSummaryBacklogConfig()


 class Settings(BaseModel):
@@ -106,23 +122,29 @@ def load_settings() -> Settings:

        if "ai" in raw:
            ai_raw = raw["ai"]
+            models = ai_raw.get("models", {})
+            # B-0: triage 는 config.yaml 에 없을 수도 있는 신규 슬롯. 구버전 호환을 위해
+            # 없으면 fallback 를 triage 로 대체 (동일 모델 재사용).
+            triage_raw = models.get("triage") or models.get("fallback")
+            if triage_raw is None:
+                raise ValueError("config.yaml: ai.models.triage (or fallback) required")
            ai_config = AIConfig(
                gateway_endpoint=ai_raw.get("gateway", {}).get("endpoint", ""),
-                primary=AIModelConfig(**ai_raw["models"]["primary"]),
-                fallback=AIModelConfig(**ai_raw["models"]["fallback"]),
-                premium=AIModelConfig(**ai_raw["models"]["premium"]),
-                embedding=AIModelConfig(**ai_raw["models"]["embedding"]),
-                vision=AIModelConfig(**ai_raw["models"]["vision"]),
-                rerank=AIModelConfig(**ai_raw["models"]["rerank"]),
+                triage=AIModelConfig(**triage_raw),
+                primary=AIModelConfig(**models["primary"]),
+                fallback=AIModelConfig(**models["fallback"]),
+                premium=AIModelConfig(**models["premium"]),
+                embedding=AIModelConfig(**models["embedding"]),
+                rerank=AIModelConfig(**models["rerank"]),
+                vision=(AIModelConfig(**models["vision"]) if "vision" in models else None),
                classifier=(
-                    AIModelConfig(**ai_raw["models"]["classifier"])
-                    if "classifier" in ai_raw.get("models", {})
-                    else None
+                    AIModelConfig(**models["classifier"]) if "classifier" in models else None
                ),
                verifier=(
-                    AIModelConfig(**ai_raw["models"]["verifier"])
-                    if "verifier" in ai_raw.get("models", {})
-                    else None
+                    AIModelConfig(**models["verifier"]) if "verifier" in models else None
+                ),
+                deep_summary_backlog=DeepSummaryBacklogConfig(
+                    **ai_raw.get("deep_summary_backlog", {})
                ),
            )

@@ -5,15 +5,28 @@ ai:
    endpoint: "http://ai-gateway:8080"

  models:
+    # ─── 2-tier routing (PR-B) ───
+    # triage: 상시 분류·요약·근거 선별. GPU Ollama gemma-4b (Q8_0, ~11.6GB).
+    #         concurrent OK — llm_gate Semaphore 경유 불필요.
+    triage:
+      endpoint: "http://ollama:11434/v1/chat/completions"
+      model: "gemma4:e4b-it-q8_0"
+      max_tokens: 4096
+      timeout: 30
+      context_char_limit: 120000
+
+    # primary: 에스컬레이션 전용. 26B MLX (맥미니 Semaphore(1) 보호 대상).
    primary:
      endpoint: "http://100.76.254.116:8801/v1/chat/completions"
      model: "mlx-community/gemma-4-26b-a4b-it-8bit"
-      max_tokens: 4096
-      timeout: 60
+      max_tokens: 8192
+      timeout: 180
+      context_char_limit: 260000

+    # fallback: primary 장애 시 최후 방어선. triage 와 동일 모델 — gemma-4b 로 퇴행 허용.
    fallback:
      endpoint: "http://ollama:11434/v1/chat/completions"
-      model: "qwen3.5:9b-q8_0"
+      model: "gemma4:e4b-it-q8_0"
      max_tokens: 4096
      timeout: 120

@@ -28,19 +41,21 @@ ai:
      endpoint: "http://ollama:11434/api/embeddings"
      model: "bge-m3"

-    vision:
-      endpoint: "http://ollama:11434/api/generate"
-      model: "Qwen2.5-VL-7B"
-
    rerank:
      endpoint: "http://ollama:11434/api/rerank"
      model: "bge-reranker-v2-m3"
-    # Phase 3.5a: exaone answerability classifier (GPU Ollama, concurrent OK)
-    classifier:
-      endpoint: "http://ollama:11434/v1/chat/completions"
-      model: "exaone3.5:7.8b-instruct-q8_0"
-      max_tokens: 512
-      timeout: 10
+    # 제거: classifier (Phase 3.5a exaone 흔적 — classifier_service 가 hasattr 로 optional
+    #       처리하므로 제거 안전) / vision (미사용)
+
+  # ─── deep_summary enqueue 폭발 억제 (B-1 R2) ───
+  # 초기 튜닝 전 deep_summary 큐에 soft escalate 가 과발생하면 MLX 26B 가 포화된다.
+  # 아래 임계치 중 하나라도 초과하면 soft escalate (recommend_deep_summary 만) 를
+  # suppress. hard escalate (long_context / triage_json_invalid / low_confidence)는
+  # 절대 suppress 되지 않는다.
+  deep_summary_backlog:
+    ratio_threshold: 0.3      # 지난 window 의 deep_n/classify_n
+    pending_threshold: 5      # deep_summary stage 의 pending+processing
+    window_minutes: 30

 nas:
  mount_path: "/documents"