gpu-services

Author	SHA1	Message	Date
Hyungi Ahn	cc2c9467fe	fix(infra-agent): mute mlx alerts during KST 0-7h backfill window Document Server tier_backfill 가 KST 0~6시 사이 26B 에 batch enqueue 하면서 /v1/models 응답이 5~10초 lock 돼 healthcheck timeout 알람이 반복 발생. 정책 의도(야간=batch 점유 시간)와 healthcheck SLA(24/7 동일) 불일치 해결. - KST 0~7시 (정책 0~6 + 잔여 처리 1h buffer) 는 mlx down/degraded 를 log-only 로 격하 - 주간 timeout 은 그대로 알람 (실사용자 영향 시그널 보존) - 다른 서비스 (document-server, ollama-gpu) 는 영향 없음 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:34:04 +09:00
Hyungi Ahn	378866a99c	fix(infra): agent alert-on-change — debounce + stable key + MacBook 제외 - 상태 파일로 이전 이슈 추적 (~/Library/Application Support/infra-agent/) - stable issue key (docker:gpu:container:status 형태) - 2회 연속 실패 시 알림, 2회 연속 성공 시 복구 알림 - 동일 이슈 지속 시 무음 (alert storm 방지) - MacBook Pro를 EXPECTED_TAILSCALE_HOSTS에서 제거 (잠자기는 정상) - state file atomic write + 손상 시 graceful fallback Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 06:53:24 +09:00
Hyungi Ahn	03e3df058f	feat(infra): docker_restart 쓰기 도구 추가 보호 컨테이너(home-caddy, home-fail2ban, nanoclaude) 재시작 차단. MCP 11개 도구 + NanoClaude wrapper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 15:06:40 +09:00
Hyungi Ahn	d47c04317c	feat(infra): Phase 1.5 진단 도구 3개 + trace 정리 scheduler_status, queue_status, run_verify 추가. MCP 10개 도구 + NanoClaude wrapper + pre-route 키워드. worker.py trace print 제거. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:27:19 +09:00
Hyungi Ahn	1abec083e7	feat(nanoclaude): 배포 준비 — Dockerfile + self-SSH 로컬 분기 - Dockerfile: infra/ 복사, openssh-client, healthcheck 추가 - requirements.txt: asyncssh, python-dotenv 추가 - core/ssh.py: INFRA_LOCAL_HOST 환경변수로 self-SSH 대신 로컬 실행 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 13:31:41 +09:00
Hyungi Ahn	82ce83b8b7	feat(infra): Phase 2.1 Gemma 4 알림 자연어 설명 이상 감지 시 Gemma 4(MLX localhost:8801)로 원인 분석 + 권장 조치 생성. Gemma 실패해도 rule 결과만으로 알림 전송 (graceful degradation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 13:21:02 +09:00
Hyungi Ahn	ac8787c153	feat(infra): Phase 2 monitoring agent — rule-first + 시놀로지 Chat 알림 5분 cron용 agent. docker/disk/health/network 4개 체크. asyncssh 로그 억제, 작은 파티션(< 1G) 무시. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 13:16:01 +09:00
Hyungi Ahn	b1f9e87d6a	feat(infra): MCP 인프라 서버 통합 — 7개 도구 + core/ 분리 mcp-infra-server를 gpu-services/infra/로 통합. core/ 순수 로직은 Agent/NanoClaude에서도 직접 import 가능. 도구: docker_status, docker_logs, service_health, disk_usage, tailscale_status, ollama_models, mlx_models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 13:11:54 +09:00

8 Commits