1 Commits

Author SHA1 Message Date
hyungi 329c9eac76 feat(documents): PR-Chore-OCR-Column-1 add ocr_derived column
RAG-independent data hygiene. ocr_derived 식별 컬럼 부재 = PR-Eval-V0_2
TBD-O FAILED 원인. 향후 OCR/Marker Diagnose, markdown 품질 분류,
ingest 품질 통계 어디에서나 재사용 가능.

Schema: documents.ocr_derived BOOLEAN NOT NULL DEFAULT false.
Backfill rule R1 단독 (실측 audit 후): extract_meta ? ocr_attempted
AND ocr_attempted = true. 8 rows true / 21727 false.

R2 (file_format IN png/jpg) 폐기 — 1건 R1 흡수 + 1건 marker 미처리.
R3 (marker PDF extract_meta 부재 283 rows) 폐기 — born-digital
false positive 위험. UPDATE 전 candidate preview + source rule별
count + 표본 audit gate 통과 후 적용.

asyncpg single-statement 제약으로 ALTER (277) + UPDATE (278) 분리.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 06:11:29 +00:00