feat(extract): OCR 트리거 규칙 + extract_meta JSONB

스캔 PDF/이미지 자동 OCR 트리거 + 결과 품질 검증 + 1회 제한. - extract_meta JSONB 컬럼 추가 (migration 134) ocr_attempted, ocr_reason, ocr_skip_reason, ocr_terminal, ocr_chars - PDF OCR 트리거: total_chars < 300 또는 avg < 80 && total < 3000 - 이미지 자동 OCR: jpg/png/tiff/webp 등 - 품질 차등: 이미지 50자, PDF 200자 또는 페이지당 30자 - 상한: pages > 200 또는 file_size > 150MB → 스킵 - OCR 1회 제한: extract_meta.ocr_attempted로 재시도 방지 - extractor_version은 도구명만 (surya_ocr/pymupdf/kordoc) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:04:13 +09:00
parent 7883ac67b3
commit 088966bf78
4 changed files with 191 additions and 35 deletions
@@ -0,0 +1,2 @@
+ALTER TABLE documents ADD COLUMN IF NOT EXISTS extract_meta JSONB DEFAULT '{}';
+COMMENT ON COLUMN documents.extract_meta IS 'OCR 판정/실행 메타데이터: ocr_attempted, ocr_reason, ocr_skip_reason, ocr_chars, pymupdf_chars, ocr_terminal';