Files
hyungi_document_server/services/stt/server.py
T
Hyungi Ahn 1e2c004dd4 feat(media): §3 audio STT + video 재생 인프라
plan: ~/.claude/plans/luminous-sprouting-hamster.md §3

스키마:
- migrations/147_audio_segments_table.sql: audio_segments (STT 타임스탬프
  세그먼트)
- migrations/148_audio_segments_idx.sql: (document_id, start_s) idx
- migrations/149_document_media_cols.sql: documents.thumbnail_path +
  needs_conversion
- migrations/150_queue_stage_stt.sql: process_stage += 'stt'
- migrations/151_queue_stage_thumbnail.sql: process_stage += 'thumbnail'
- app/models/audio_segment.py, document.py (thumbnail_path/needs_conversion)

서비스:
- services/stt/{Dockerfile, requirements.txt, server.py} — faster-whisper
  large-v3 GPU 컨테이너. /transcribe (filePath/langs/beamSize) +
  /health + /ready (cuda device_count + model_loaded). NFC/NFD 경로
  resolver (OCR 교훈).
- docker-compose.yml: stt-service 추가 (GPU 1 예약, :3300, NAS ro mount,
  stt_models volume, start_period 300s), fastapi env 에 STT_ENDPOINT.

파이프라인 (의존 §1 category):
- app/workers/stt_worker.py 신규: stage='stt' pickup → STT_ENDPOINT 호출 →
  extracted_text + audio_segments 저장. Timeout 30분.
- app/workers/thumbnail_worker.py 신규: ffmpeg 50% 지점 1장 →
  PKM/Videos/.thumbs/{id}.jpg + thumbnail_path 세팅.
  needs_conversion=true 는 skip.
- app/workers/file_watcher.py 확장: PKM/{Inbox, Recordings, Videos}
  스캔. 확장자→category, audio→stage=stt, video .mp4/.webm→
  stage=thumbnail, video .mov/.mkv/.avi→needs_conversion=true + stage
  없음. settings.roon_library_path prefix skip.
- app/workers/queue_consumer.py 확장: stt + thumbnail workers 등록,
  BATCH_SIZE(stt=1, thumbnail=3), next_stages 에 stt→[classify] 추가
  (audio 는 extract 건너뜀).
- app/Dockerfile: ffmpeg 추가 (썸네일 subprocess 용).

API (의존 §1):
- /api/audio/{id}/segments — AudioSegment ORDER BY start_s
- /api/video/{id}/thumbnail — thumbnail_path FileResponse (쿼리 토큰)
- /api/documents/{id}/file: media_types 에 audio/video mime 포함 (§2
  커밋에 이미 포함). Starlette FileResponse 가 Range 자동.
- upload_document: .mov/.mkv/.avi 웹 업로드 거부 (error_code
  unsupported_codec). NAS 드롭은 file_watcher 가 quarantine 수용.

프론트:
- AudioPlayer.svelte: HTML5 audio + 전사 세그먼트 sticky 패널 + 줄
  클릭 seek. activeIdx 하이라이트.
- VideoPlayer.svelte: HTML5 video direct play + needs_conversion 안내
  카드. poster 는 thumbnail endpoint.
- /audio (목록 grid) + /audio/[id] (플레이어)
- /video (썸네일 grid + 변환 필요 배지) + /video/[id] (플레이어)
- Sidebar.svelte: Mic/Film 아이콘 + audio/video 네비 활성, count
  배지 (§2 /stats/category-counts 재사용).

설정:
- app/core/config.py: stt_endpoint + roon_library_path.

DoD 배포 후 smoke: /ready cuda:true, 회의 mp3 transcribe, audio
extract 없이 classify 진행(queue 회귀), /audio 재생, .mp4 재생,
.mov 웹 400, .mov NAS quarantine, Sidebar 네비 + count.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 06:47:36 +09:00

141 lines
3.9 KiB
Python

"""STT 마이크로서비스 — faster-whisper (GPU) 기반 음성 전사.
filePath → {text, segments:[{start,end,text}]}. 모델은 첫 요청 시 lazy loading.
기본 모델 large-v3 (VRAM ~3GB, float16). 환경변수로 교체 가능.
"""
import os
import unicodedata
from pathlib import Path
from fastapi import FastAPI
app = FastAPI()
_model = None
_MODEL_NAME = os.getenv("WHISPER_MODEL", "large-v3")
_DEVICE = os.getenv("WHISPER_DEVICE", "cuda")
_COMPUTE_TYPE = os.getenv("WHISPER_COMPUTE_TYPE", "float16")
def _resolve_path(file_path: str) -> Path | None:
"""NFC(DB) vs NFD(NFS) 한글 경로 정규화 차이 흡수. OCR 서비스와 동일 패턴."""
candidates = [
file_path,
unicodedata.normalize("NFD", file_path),
unicodedata.normalize("NFC", file_path),
]
for c in candidates:
p = Path(c)
if p.exists():
return p
# 마지막 fallback: parent 디렉토리에서 이름을 NFC 로 매칭
parent = Path(file_path).parent
if parent.exists():
target = unicodedata.normalize("NFC", Path(file_path).name)
for child in parent.iterdir():
if unicodedata.normalize("NFC", child.name) == target:
return child
return None
def _load_model():
"""faster-whisper lazy loading — 첫 호출 시만 VRAM 점유."""
global _model
if _model is not None:
return _model
from faster_whisper import WhisperModel
_model = WhisperModel(_MODEL_NAME, device=_DEVICE, compute_type=_COMPUTE_TYPE)
return _model
def _cuda_device_count() -> int:
try:
import ctranslate2
return ctranslate2.get_cuda_device_count()
except Exception:
return 0
@app.get("/health")
def health():
"""Liveness — Docker healthcheck 용, 프로세스 생존 확인."""
return {"status": "ok", "service": "stt-faster-whisper"}
@app.get("/ready")
def ready():
"""Readiness — CUDA + 모델 상태. 배포 검증용."""
count = _cuda_device_count()
cuda_ok = count > 0
models_loaded = _model is not None
return {
"ready": cuda_ok and models_loaded,
"cuda": cuda_ok,
"cuda_device_count": count,
"models_loaded": models_loaded,
"model": _MODEL_NAME,
"compute_type": _COMPUTE_TYPE,
}
@app.post("/transcribe")
async def transcribe(body: dict):
"""오디오 파일 전사.
입력:
{
"filePath": "/documents/PKM/Recordings/2026-04-23_회의.mp3",
"langs": ["ko"]?, # 단일 언어 지정 or 생략(자동감지)
"beamSize": 5? # 기본 5
}
출력:
{
"text": "전체 전사 텍스트",
"segments": [{"start": 0.0, "end": 2.4, "text": "..."}, ...],
"language": "ko",
"language_probability": 0.99,
"duration": 1832.5
}
"""
raw_path = body["filePath"]
langs = body.get("langs")
beam_size = int(body.get("beamSize", 5))
resolved = _resolve_path(raw_path)
if resolved is None:
return {"error": f"파일 없음: {raw_path}", "text": "", "segments": []}
model = _load_model()
language = None
if isinstance(langs, list) and len(langs) == 1:
language = langs[0]
segments_iter, info = model.transcribe(
str(resolved),
beam_size=beam_size,
language=language,
vad_filter=True,
)
segments = []
parts = []
for seg in segments_iter:
segments.append({
"start": round(float(seg.start), 2),
"end": round(float(seg.end), 2),
"text": seg.text.strip(),
})
parts.append(seg.text)
return {
"text": " ".join(p.strip() for p in parts).strip(),
"segments": segments,
"language": getattr(info, "language", None),
"language_probability": float(getattr(info, "language_probability", 0.0) or 0.0),
"duration": float(getattr(info, "duration", 0.0) or 0.0),
}