fix(infra-agent): mute mlx alerts during KST 0-7h backfill window

Document Server tier_backfill 가 KST 0~6시 사이 26B 에 batch enqueue 하면서
/v1/models 응답이 5~10초 lock 돼 healthcheck timeout 알람이 반복 발생.
정책 의도(야간=batch 점유 시간)와 healthcheck SLA(24/7 동일) 불일치 해결.

- KST 0~7시 (정책 0~6 + 잔여 처리 1h buffer) 는 mlx down/degraded 를 log-only 로 격하
- 주간 timeout 은 그대로 알람 (실사용자 영향 시그널 보존)
- 다른 서비스 (document-server, ollama-gpu) 는 영향 없음

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hyungi Ahn
2026-04-27 07:34:04 +09:00
parent 02cc1b3da6
commit cc2c9467fe
+13
View File
@@ -18,6 +18,7 @@ import os
import tempfile import tempfile
from datetime import datetime, timezone from datetime import datetime, timezone
from pathlib import Path from pathlib import Path
from zoneinfo import ZoneInfo
import httpx import httpx
from dotenv import load_dotenv from dotenv import load_dotenv
@@ -201,13 +202,25 @@ async def check_disk_rules() -> dict[str, str]:
async def check_health_rules() -> dict[str, str]: async def check_health_rules() -> dict[str, str]:
issues: dict[str, str] = {} issues: dict[str, str] = {}
# KST 0~7시는 Document Server tier_backfill 가동 시간대 (정책 0~6시 + 잔여 처리 1h buffer).
# 26B 가 batch 점유로 /v1/models 응답이 5~10초 lock 되는 게 정상이므로 mlx 알람만 격하.
# 정책: ~/Documents/code/hyungi_Document_Server/app/workers/tier_backfill.py NIGHT_START/END_HOUR
kst_hour = datetime.now(tz=ZoneInfo("Asia/Seoul")).hour
is_backfill_window = 0 <= kst_hour < 7
for svc in HEALTH_SERVICES: for svc in HEALTH_SERVICES:
result = await service_health(svc) result = await service_health(svc)
if not result.ok: if not result.ok:
if svc == "mlx" and is_backfill_window:
log.info("[mute] mlx down — KST %d시 backfill window", kst_hour)
continue
detail = result.error or result.status detail = result.error or result.status
k = _health_key(svc, "down") k = _health_key(svc, "down")
issues[k] = f"서비스 다운: {svc}{detail}" issues[k] = f"서비스 다운: {svc}{detail}"
elif result.status == "degraded": elif result.status == "degraded":
if svc == "mlx" and is_backfill_window:
log.info("[mute] mlx degraded — KST %d시 backfill window", kst_hour)
continue
k = _health_key(svc, "degraded") k = _health_key(svc, "degraded")
issues[k] = f"서비스 저하: {svc}" issues[k] = f"서비스 저하: {svc}"
return issues return issues