hyungi_document_server

Author	SHA1	Message	Date
Hyungi Ahn	010e25cb23	fix(queue): doc-level embed metadata 기반 + NUL 바이트 strip + 빈 예외 fallback embed_worker: - extracted_text[:6000] → title + ai_summary + tags(top 5) metadata 입력 - 500k자 문서의 표지+목차가 임베딩되는 구조적 버그 해결 - Ollama 기본 context 안전 (~1500자 이하), num_ctx 조정 불필요 - ai_summary < 50자 시 본문 800자 fallback - ai_domain 은 초기 제외 (taxonomy 노이즈 방지) extract_worker: - kordoc / 직접 읽기 / LibreOffice 3 경로 모두 \x00 strip - asyncpg CharacterNotInRepertoireError 재발 방지 queue_consumer: - str(e) or repr(e) or type(e).__name__ fallback - 빈 메시지 예외(24건 발생) 다음부터 클래스명이라도 기록 plan: ~/.claude/plans/quiet-meandering-nova.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 13:45:55 +09:00
Hyungi Ahn	24142ea605	fix: Codex 리뷰 5건 수정 (critical 1 + high 4) 1. [critical] config.yaml → settings 객체에서 taxonomy 로드 (import crash 방지) 2. [high] ODF 변환: file_path 유지, derived_path 별도 필드 (무한 중복 방지) 3. [high] 법령 분할: 첫 장 이전 조문을 "서문"으로 보존 4. [high] Inbox: review_status 필드 분리 (pending/approved/rejected) 5. [high] 삭제: soft-delete (deleted_at) + worker 방어 + active_documents 뷰 - 모든 조회에 deleted_at IS NULL 일관 적용 - queue_consumer: row 없으면 gracefully skip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 07:15:13 +09:00
Hyungi Ahn	1b5fa95a9f	feat: 오피스 → ODF 변환 + 원본/편집본 분리 아키텍처 - original_path/format/hash + conversion_status 필드 추가 (migration 007) - extract_worker: 텍스트 추출 후 xlsx→ods, docx→odt 등 ODF 변환 - 변환본은 .derived/{doc_id}.ods 에 저장 - 원본 메타 보존 (original_path/format/hash) - file_watcher: .derived/ .preview/ 디렉토리 제외 - DocumentViewer: ODF 포맷이면 편집 버튼 자동 표시 - edit_url 있으면 "편집", 없으면 "Synology Drive에서 열기" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 13:11:43 +09:00
Hyungi Ahn	03b0612aa2	fix: extract_worker OFFICE_FORMATS 블록에 return 누락 수정 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 11:28:09 +09:00
Hyungi Ahn	a5186bf4aa	fix: 스프레드시트 텍스트 추출 — csv 필터 사용 (txt:Text는 Calc 미지원) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 11:21:29 +09:00
Hyungi Ahn	b37043d651	fix: LibreOffice 한글 파일명 호환 — 영문 임시파일로 복사 후 변환 extract_worker, preview_worker 모두 적용. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 11:18:06 +09:00
Hyungi Ahn	45448b4036	feat: extract_worker에 LibreOffice 텍스트 추출 추가 (오피스 포맷) - xlsx, docx, pptx, odt, ods, odp, odoc, osheet 지원 - LibreOffice --convert-to txt로 텍스트 추출 (60s timeout) - 추가 의존성 없음 (Docker에 이미 설치된 LibreOffice 사용) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 11:12:19 +09:00
Hyungi Ahn	299fac3904	feat: implement Phase 1 data pipeline and migration - Implement kordoc /parse endpoint (HWP/HWPX/PDF via kordoc lib, text files direct read, images flagged for OCR) - Add queue consumer with APScheduler (1min interval, stage chaining extract→classify→embed, stale item recovery, retry logic) - Add extract worker (kordoc HTTP call + direct text read) - Add classify worker (Qwen3.5 AI classification with think-tag stripping and robust JSON extraction from AI responses) - Add embed worker (GPU server nomic-embed-text, graceful failure) - Add DEVONthink migration script with folder mapping for 16 DBs, dry-run mode, batch commits, and idempotent file_path UNIQUE - Enhance ai/client.py with strip_thinking() and parse_json_response() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 14:35:36 +09:00

8 Commits