hyungi_document_server/app/services/search/llm_gate.py

"""MLX single-inference 전역 gate (Phase 3.1.1 + B-1 Priority Gate).

Mac mini MLX primary(gemma-4-26b-a4b-it-8bit)는 **single-inference**다.
동시 호출이 들어오면 queue가 폭발한다(실측: 23 concurrent 요청 → 22개 15초 timeout).

이 모듈은 analyzer / evidence / classifier / synthesis(gemma-macmini backend
한정) 등 **Mac mini MLX endpoint 로 향하는 모든 호출**이 공유하는 **우선순위
기반 gate** 를 제공한다. concurrency 는 1 고정이지만 queue 의 ordering 은
`Priority.FOREGROUND` (user-facing ask) 가 `Priority.BACKGROUND` (digest/
briefing/worker) 보다 먼저 dispatch.

PR-MacBook-RAG-Backend-1 부터 `services.llm.QwenMacBookBackend` 는 별 endpoint
(MacBook mlx-vlm.server) 라 본 gate 와 무관 — 자체 Semaphore(1) 사용.

## 영구 룰

- **Mac mini MLX endpoint 호출 경로는 예외 없이 gate 획득 필수**. query_analyzer /
  evidence / classifier / `synthesis (gemma-macmini backend)` 가 현재 사용자.
  이후 경로가 늘어도 **같은 Mac mini endpoint** 라면 동일 gate를 import해서
  사용한다. 새 Semaphore를 만들지 말 것 (같은 endpoint 에서 큐 분할 시 동시 실행
  발생, [[feedback_docstring_invariant_swap_audit]] PR #20 사고 케이스).
  다른 endpoint (MacBook 등) 는 그 endpoint 전용 별 gate 를 둔다 — 본 gate 와
  무관.
- **`asyncio.timeout(...)`은 gate 안쪽에서만 적용**. gate 대기 자체에 timeout을
  걸면 "대기만으로 timeout 발동" 버그가 재발한다(query_analyzer 초기 이슈).
- **fallback(Claude Sonnet 4 API) 경로는 gate 제외**. PR #20 이후 fallback = Claude API. 단 현재
  구현상 `AIClient._call_chat` 내부에서 primary→fallback 전환이 일어나므로
  fallback도 gate 점유 상태로 실행된다. 허용 가능(fallback 빈도 낮음).
- ~~**MLX concurrency는 `MLX_CONCURRENCY = 1` 고정**~~ → **2026-06-12 개정**:
  구 룰의 전제(서버 = single-inference)가 소멸 — 현 mlx_vlm server 는 continuous
  batching 으로 동시 스트림 흡수(실측). 상한은 config `pipeline.mlx_gate_concurrency`
  (기본 1, 운영 2). **게이트 자체(상한+우선순위 큐)는 영구 유지** — thundering herd
  (23 concurrent → 22 timeout 사고) 방지는 계속 이 상한이 담당. 무제한 금지.

## 우선순위 정책 (B-1, 2026-05-17)

- `Priority.FOREGROUND = 0`: user-facing path (`/api/search/ask`, 사용자 동기
  API, Hermes orchestrator 경유). 가능한 빨리 dispatch.
- `Priority.BACKGROUND = 100`: digest / briefing / classify-escalate /
  study_* worker / query_analyzer prewarm. foreground 가 비어 있을 때만 dispatch.
- **DEFAULT_PRIORITY = BACKGROUND**: priority 미지정 호출은 foreground 짓밟지
  않는다 (안전 default).
- **preemption 없음**: 이미 inflight 인 background 는 끊지 않는다. foreground 가
  들어와도 현재 점유 background 의 남은 시간만큼은 대기. 단 background 2~5
  까지 줄 서있던 큐는 foreground 가 앞으로 jump.
- **starvation aging 없음** (Phase 2 deferred). 단 BACKGROUND wait_ms > 5분이면
  WARN 로그 — 원인 추적 단서.

## 사용 예

```python
from services.search.llm_gate import acquire_mlx_gate, Priority

async def user_ask_path(...):
    async with acquire_mlx_gate(Priority.FOREGROUND):
        async with asyncio.timeout(30):
            raw = await ai_client._call_chat(ai_client.ai.primary, prompt)

async def background_worker(...):
    async with acquire_mlx_gate(Priority.BACKGROUND):
        ...
```

## 확장 여지

- aging (background 대기 시간 → priority boost) — Phase 2
- concurrency > 1 일반화 — B-2 (Throughput)
- 별 gate 분리 (`get_analyzer_gate` / `get_ask_gate`) — single-inference 에서
  throughput 개선 없으므로 의미 없음 (PriorityQueue 안의 priority 만으로 충분)
"""

from __future__ import annotations

import asyncio
import heapq
import itertools
import time
from contextlib import asynccontextmanager
from enum import IntEnum
from typing import AsyncIterator

from core.utils import setup_logger

logger = setup_logger("llm_gate")


def _capacity() -> int:
    """게이트 동시 실행 상한 — config.yaml `pipeline.mlx_gate_concurrency` (기본 1).

    2026-06-12 일반화: "MLX_CONCURRENCY = 1 고정" 영구 룰의 전제(구 서버 = single-
    inference, 23 concurrent → 22 timeout 실측)가 소멸 — 현 mlx_vlm server 는
    continuous batching 으로 동시 스트림을 흡수(2026-06-11 밤 6~8 concurrent 실측
    정상). 게이트 자체(상한 + 우선순위)는 유지하고 상한만 config 로 — thundering
    herd 재발 방지는 이 상한이 계속 담당한다. 런타임 매 acquire 시 조회라
    config 변경 + 프로세스 재기동으로 반영, 테스트는 settings monkeypatch.
    """
    from core.config import settings
    try:
        return max(1, int(getattr(settings, "mlx_gate_concurrency", 1)))
    except (TypeError, ValueError):
        return 1

# Background waiter wait_ms 가 이 값 초과 시 WARN (starvation 신호, aging mitigation 은 Phase 2)
STARVATION_WARN_MS = 300_000  # 5 min


class Priority(IntEnum):
    """MLX gate dispatch 우선순위. 낮을수록 먼저 dispatch."""

    FOREGROUND = 0
    BACKGROUND = 100


DEFAULT_PRIORITY: Priority = Priority.BACKGROUND


# ── Internal state (lazy init on first acquire) ─────────────────────────────
# Tuple format: (priority: int, seq: int, future: asyncio.Future, enqueue_ts: float)
_waiters: list[tuple[int, int, asyncio.Future, float]] = []
_seq = itertools.count()
_inflight_n: int = 0  # 동시 실행 수 (구 bool — capacity 일반화로 카운터)
_lock: asyncio.Lock | None = None


def _get_lock() -> asyncio.Lock:
    """Lazy init Lock on the current event loop."""
    global _lock
    if _lock is None:
        _lock = asyncio.Lock()
    return _lock


def _dispatch_next_locked() -> asyncio.Future | None:
    """다음 살아있는 waiter 의 Future 를 pop 후 반환. cancelled/done 인 entry skip.

    caller 는 lock 보유 상태에서 호출. 반환된 Future 의 set_result 는 lock 밖에서.
    """
    while _waiters:
        priority, seq, fut, enqueue_ts = heapq.heappop(_waiters)
        if fut.cancelled() or fut.done():
            continue  # timeout/cancel 후 죽은 Future 건너뜀
        return fut
    return None


@asynccontextmanager
async def acquire_mlx_gate(
    priority: Priority = DEFAULT_PRIORITY,
) -> AsyncIterator[None]:
    """우선순위 기반 MLX primary gate.

    Args:
        priority: Priority.FOREGROUND (user-facing) 또는 BACKGROUND (worker).
                  미지정 시 BACKGROUND (안전 default).

    사용 예:
        async with acquire_mlx_gate(Priority.FOREGROUND):
            async with asyncio.timeout(30):
                raw = await ai_client._call_chat(ai_client.ai.primary, prompt)

    ⚠ `asyncio.timeout` 은 반드시 gate 안쪽 (Future await 후) 에 둘 것.
    """
    global _inflight_n, _waiters

    lock = _get_lock()
    seq = next(_seq)
    enqueue_ts = time.monotonic()
    waited = False
    fut: asyncio.Future | None = None

    async with lock:
        if _inflight_n < _capacity() and not _waiters:
            # fast path — 즉시 inflight 진입, Future 생성 안 함
            _inflight_n += 1
        else:
            # 대기열 진입
            fut = asyncio.get_event_loop().create_future()
            heapq.heappush(_waiters, (int(priority), seq, fut, enqueue_ts))
            queue_len = len(_waiters)
            logger.debug(
                "mlx_gate enqueue priority=%s seq=%d queue_len=%d",
                priority.name, seq, queue_len,
            )
            waited = True

    if waited and fut is not None:
        # lock 밖에서 await — release 가 lock 안에서 set_result 하면 reentry deadlock
        await fut

    # inflight 진입 — wait_ms 측정 + dispatch log + starvation WARN
    wait_ms = (time.monotonic() - enqueue_ts) * 1000.0 if waited else 0.0
    if waited:
        async with lock:
            queue_len_post = len(_waiters)
        logger.info(
            "mlx_gate dispatch priority=%s seq=%d wait_ms=%.0f queue_len=%d",
            priority.name, seq, wait_ms, queue_len_post,
        )
        if priority == Priority.BACKGROUND and wait_ms > STARVATION_WARN_MS:
            logger.warning(
                "mlx_gate background waiter starved wait_ms=%.0f priority=%s seq=%d",
                wait_ms, priority.name, seq,
            )

    inflight_start = time.monotonic()
    try:
        yield
    finally:
        duration_ms = (time.monotonic() - inflight_start) * 1000.0
        next_fut: asyncio.Future | None = None
        async with lock:
            next_fut = _dispatch_next_locked()
            if next_fut is None:
                _inflight_n = max(0, _inflight_n - 1)
            # next_fut 가 있으면 슬롯 handover — 카운트 유지 (다음 waiter 가 진입 예정)
        logger.debug(
            "mlx_gate release duration_ms=%.0f priority=%s seq=%d",
            duration_ms, priority.name, seq,
        )
        if next_fut is not None:
            # lock 밖에서 set_result — reentry deadlock 회피
            loop = asyncio.get_event_loop()
            loop.call_soon(next_fut.set_result, None)


# ── Backward compat: context-manager only wrapper ────────────────────────────


def get_mlx_gate():
    """Legacy wrapper — `async with get_mlx_gate():` 형태만 호환.

    내부적으로 `acquire_mlx_gate(DEFAULT_PRIORITY)` (= BACKGROUND) 로 위임한다.
    새 호출 site 는 `acquire_mlx_gate(Priority.FOREGROUND|BACKGROUND)` 명시 사용.

    ⚠ **Semaphore-like API 미지원** — `sem = get_mlx_gate(); await sem.acquire()`
    같은 직접 acquire/release 패턴은 동작하지 않는다. 발견 시 호출 site 를
    `async with acquire_mlx_gate(...)` 로 명시적 교체.
    """
    return acquire_mlx_gate(DEFAULT_PRIORITY)


# ── Read-only status (UI 표시용) ─────────────────────────────────────────────


def gate_status() -> dict:
    """현재 gate 점유 스냅샷 (read-only, lock-free 근사치 — UI 표시용).

    inflight = 동시 실행 수(int). 기존 소비자(eid status)는 bool() 캐스팅이라 호환.
    """
    return {"inflight": _inflight_n, "waiters": len(_waiters)}


# ── Test helpers (conftest reset) ────────────────────────────────────────────


def _reset_for_test() -> None:
    """테스트 fixture 가 fresh loop 마다 호출. production code 에서 사용 X."""
    global _waiters, _inflight_n, _lock, _seq
    _waiters = []
    _inflight_n = 0
    _lock = None
    _seq = itertools.count()