feat(chunk): Phase 0.1 chunk 인덱싱 — ORM/worker/migration 정리

GPU 서버에 untracked로만 존재하던 Phase 0.1 코드를 정식 commit:
- app/models/chunk.py — DocumentChunk ORM (country/source/domain 메타 포함)
- app/workers/chunk_worker.py — 6가지 chunking 전략 (legal/news/markdown/email/long_pdf/default)
- migrations/014_document_chunks.sql — pgvector + FTS + trigram 인덱스
- app/models/queue.py — ProcessingQueue enum에 'chunk' stage 추가
- app/workers/queue_consumer.py — chunk stage 등록, classify→[embed,chunk] 자동 연결

Phase 1 reranker 통합 작업의 전제 조건. document_chunks 테이블 기반 retrieval에 사용.
This commit is contained in:
Hyungi Ahn
2026-04-07 13:26:37 +09:00
parent a2941487fe
commit 378fbc7845
5 changed files with 442 additions and 3 deletions

View File

@@ -14,7 +14,7 @@ class ProcessingQueue(Base):
id: Mapped[int] = mapped_column(BigInteger, primary_key=True)
document_id: Mapped[int] = mapped_column(BigInteger, ForeignKey("documents.id"), nullable=False)
stage: Mapped[str] = mapped_column(
Enum("extract", "classify", "summarize", "embed", "preview", name="process_stage"), nullable=False
Enum("extract", "classify", "summarize", "embed", "chunk", "preview", name="process_stage"), nullable=False
)
status: Mapped[str] = mapped_column(
Enum("pending", "processing", "completed", "failed", name="process_status"),