Files

Hyungi Ahn c4c32170f1 feat: NanoClaude Phase 2 — EXAONE→Gemma 파이프라인, 큐, 상태 API

- ModelAdapter: 범용 OpenAI-compat 어댑터 (stream/complete/health)
- BackendRegistry: rewriter(EXAONE) + reasoner(Gemma4) 헬스체크 루프
- 2단계 파이프라인: EXAONE rewrite → Gemma reasoning (SSE rewrite 이벤트 노출)
- Fallback: 맥미니 다운 시 EXAONE 단독 모드, stream 중간 실패 시 자동 전환
- Cancel-safe: rewrite 전/후, streaming loop 내, fallback 경로 모두 체크
- Rewrite heartbeat: complete_chat 대기 중 2초 간격 processing 이벤트
- JobQueue: Semaphore(3) 기반 동시성 제한, 정확한 queue position
- GET /chat/{job_id}/status, GET /queue/stats 엔드포인트
- DB: rewrite_model, reasoning_model, rewritten_message 컬럼 추가

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-06 12:04:15 +09:00

1.4 KiB

Raw Blame History

AI Gateway

GPU 서버(RTX 4070 Ti Super)에서 운영하는 중앙 AI 라우팅 서비스. 모든 AI 요청을 하나의 OpenAI 호환 API로 통합.

서비스 구조

서비스	디렉토리	스택	포트
Caddy	caddy/	Caddy 2	80/443
hub-api	hub-api/	FastAPI + aiosqlite	8000
hub-web	hub-web/	Vite + React + shadcn/ui	3000
NanoClaude	nanoclaude/	FastAPI + aiosqlite	8100

외부 연결

GPU Ollama: host.docker.internal:11434
맥미니 Ollama: 100.115.153.119:11434
맥미니 MLX: 192.168.1.122:8800 (Gemma 4)
NanoClaude: localhost:8100 (EXAONE → Gemma 파이프라인)

개발

cd hub-api
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

배포

docker compose up -d --build

API

OpenAI 호환: /v1/chat/completions, /v1/models, /v1/embeddings 인증: /auth/login → Cookie 또는 Bearer 토큰 모니터링: /health, /gpu

NanoClaude API

비동기 job 기반: POST /nano/chat → { job_id }, GET /nano/chat/{job_id}/stream → SSE 상태: GET /nano/chat/{job_id}/status, 큐: GET /nano/queue/stats 취소: POST /nano/chat/{job_id}/cancel 파이프라인: EXAONE (rewrite) → Gemma 4 (reasoning), 맥미니 다운 시 EXAONE fallback

백엔드 설정

backends.json에서 백엔드 추가/제거. 서비스 재시작 필요.

1.4 KiB Raw Blame History