feat: implement Phase 1 data pipeline and migration

- Implement kordoc /parse endpoint (HWP/HWPX/PDF via kordoc lib,
  text files direct read, images flagged for OCR)
- Add queue consumer with APScheduler (1min interval, stage chaining
  extract→classify→embed, stale item recovery, retry logic)
- Add extract worker (kordoc HTTP call + direct text read)
- Add classify worker (Qwen3.5 AI classification with think-tag
  stripping and robust JSON extraction from AI responses)
- Add embed worker (GPU server nomic-embed-text, graceful failure)
- Add DEVONthink migration script with folder mapping for 16 DBs,
  dry-run mode, batch commits, and idempotent file_path UNIQUE
- Enhance ai/client.py with strip_thinking() and parse_json_response()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hyungi Ahn
2026-04-02 14:35:36 +09:00
parent 23ee055357
commit 299fac3904
9 changed files with 682 additions and 13 deletions

View File

@@ -16,11 +16,21 @@ from models.user import User
@asynccontextmanager
async def lifespan(app: FastAPI):
"""앱 시작/종료 시 실행되는 lifespan 핸들러"""
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from workers.queue_consumer import consume_queue
# 시작: DB 연결 확인
await init_db()
# TODO: APScheduler 시작 (Phase 3)
# APScheduler: 큐 소비자 1분 간격 실행
scheduler = AsyncIOScheduler()
scheduler.add_job(consume_queue, "interval", minutes=1, id="queue_consumer")
scheduler.start()
yield
# 종료: DB 엔진 정리
# 종료: 스케줄러 → DB 순서로 정리
scheduler.shutdown(wait=False)
await engine.dispose()