Commit Graph

16 Commits

Author SHA1 Message Date
Hyungi Ahn
6d73e7ee12 feat: 분류 체계 전면 개편 — taxonomy + document_type + confidence
- config.yaml: 6개 domain × 3단계 taxonomy + 13개 document_types 정의
- classify.txt: 영문 프롬프트, taxonomy 경로 기반 분류 + 분류 규칙 주입
- classify_worker: taxonomy 검증, confidence 기반 분류, document_type 저장
- migration 008: document_type, importance, ai_confidence 컬럼
- API: DocumentResponse에 document_type, importance, ai_confidence 추가

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 13:32:20 +09:00
Hyungi Ahn
1b5fa95a9f feat: 오피스 → ODF 변환 + 원본/편집본 분리 아키텍처
- original_path/format/hash + conversion_status 필드 추가 (migration 007)
- extract_worker: 텍스트 추출 후 xlsx→ods, docx→odt 등 ODF 변환
  - 변환본은 .derived/{doc_id}.ods 에 저장
  - 원본 메타 보존 (original_path/format/hash)
- file_watcher: .derived/ .preview/ 디렉토리 제외
- DocumentViewer: ODF 포맷이면 편집 버튼 자동 표시
  - edit_url 있으면 "편집", 없으면 "Synology Drive에서 열기"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 13:11:43 +09:00
Hyungi Ahn
733f730e16 fix: preview enum 누락 + AI summary thinking 제거 + CLAUDE.md 전면 갱신
- queue.py: process_stage enum에 'preview' 추가
- classify_worker: ai_summary에 strip_thinking() 적용
- CLAUDE.md: 현재 아키텍처 전면 반영 (파이프라인, UI, 인프라, 코딩규칙)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 12:38:59 +09:00
Hyungi Ahn
6893ea132d refactor: preview 병렬 트리거 + 파일 이동 제거 + domain 색상 바
- queue_consumer: extract 완료 시 classify + preview 동시 등록
- classify_worker: _move_to_knowledge() 제거, 파일 원본 위치 유지
- DocumentCard: 좌측 domain별 색상 바 (4px) 추가

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 12:31:57 +09:00
Hyungi Ahn
03b0612aa2 fix: extract_worker OFFICE_FORMATS 블록에 return 누락 수정
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 11:28:09 +09:00
Hyungi Ahn
a5186bf4aa fix: 스프레드시트 텍스트 추출 — csv 필터 사용 (txt:Text는 Calc 미지원)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 11:21:29 +09:00
Hyungi Ahn
b37043d651 fix: LibreOffice 한글 파일명 호환 — 영문 임시파일로 복사 후 변환
extract_worker, preview_worker 모두 적용.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 11:18:06 +09:00
Hyungi Ahn
45448b4036 feat: extract_worker에 LibreOffice 텍스트 추출 추가 (오피스 포맷)
- xlsx, docx, pptx, odt, ods, odp, odoc, osheet 지원
- LibreOffice --convert-to txt로 텍스트 추출 (60s timeout)
- 추가 의존성 없음 (Docker에 이미 설치된 LibreOffice 사용)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 11:12:19 +09:00
Hyungi Ahn
4bea408bbd feat: Markdown 편집기 + PDF 변환 파이프라인 + 뷰어 포맷 분기
- Markdown split editor: textarea + marked preview, Ctrl+S 저장
- PUT /api/documents/{id}/content: 원본 파일 저장 + extracted_text 갱신
- GET /api/documents/{id}/preview: PDF 미리보기 캐시 서빙
- preview_worker: LibreOffice headless → PDF 변환 (timeout 60s, retry 1회)
- queue_consumer: preview stage 추가 (embed 후 자동 트리거)
- DocumentViewer: 포맷별 분기 (markdown/pdf/preview-pdf/image/text/cad)
- 오피스/CAD 문서: 새 탭 편집 버튼
- Dockerfile: LibreOffice headless 설치
- migration 005: preview_status, preview_hash, preview_at 컬럼

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 10:10:03 +09:00
Hyungi Ahn
62f5eccb96 fix: isolate each worker call in independent async session
Shared session between queue consumer and workers caused
MissingGreenlet errors in APScheduler context. Each worker
call now gets its own session with explicit commit/rollback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 08:29:14 +09:00
Hyungi Ahn
46537ee11a fix: Codex 리뷰 P1/P2 버그 4건 수정
- [P1] migration runner 도입: schema_migrations 추적, advisory lock,
  단일 트랜잭션 실행, SQL 검증 (기존 DB 업그레이드 대응)
- [P1] eml extract 큐 조건 분기: extract_worker 미지원 포맷 큐 스킵
- [P2] iCalendar escape_ical_text() 추가: RFC 5545 준수
- [P2] 이메일 charset 감지: get_content_charset() 사용 + payload None 방어

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:55:38 +09:00
Hyungi Ahn
d93e50b55c security: fix 5 review findings (2 high, 3 medium)
HIGH:
- Lock setup TOTP/NAS endpoints behind _require_setup() guard
  (prevented unauthenticated admin 2FA takeover after setup)
- Sanitize upload filename with Path().name + resolve() validation
  (prevented path traversal writing outside Inbox)

MEDIUM:
- Add score > 0.01 filter to hybrid search via subquery
  (prevented returning irrelevant documents with zero score)
- Implement Inbox → Knowledge file move after classification
  (classify_worker now moves files based on ai_domain)
- Add Anthropic Messages API support in _request()
  (premium/Claude path now sends correct format and parses
  content[0].text instead of choices[0].message.content)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:33:31 +09:00
Hyungi Ahn
31d5498f8d feat: implement Phase 3 automation workers
- Add automation_state table for incremental sync (last UID, last check)
- Add law_monitor worker: 국가법령정보센터 API → NAS/DB/CalDAV VTODO
  (LAW_OC 승인 대기 중, 코드 완성)
- Add mailplus_archive worker: IMAP(993) → .eml NAS save + DB + SMTP
  notification (imaplib via asyncio.to_thread, timeout=30)
- Add daily_digest worker: PostgreSQL/pipeline stats → Markdown + SMTP
  (documents, law changes, email, queue errors, inbox backlog)
- Add CalDAV VTODO helper and SMTP email helper to core/utils.py
- Wire 3 cron jobs in APScheduler (law@07:00, mail@07:00+18:00,
  digest@20:00) with timezone=Asia/Seoul

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:24:50 +09:00
Hyungi Ahn
4b695332b9 feat: implement Phase 2 core features
- Add document CRUD API (list/get/upload/update/delete with auth)
  - Upload saves to Inbox + auto-enqueues processing pipeline
  - Delete defaults to DB-only, explicit flag for file deletion
- Add hybrid search API (FTS 0.4 + trigram 0.2 + vector 0.4 weighted)
  - Modes: fts, trgm, vector, hybrid (default)
  - Vector search gracefully degrades if GPU unavailable
- Add Inbox file watcher (5min interval, new file + hash change detection)
- Register documents/search routers and file_watcher scheduler in main.py
- Add IVFFLAT vector index migration (lists=50, with tuning guide)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 14:49:12 +09:00
Hyungi Ahn
299fac3904 feat: implement Phase 1 data pipeline and migration
- Implement kordoc /parse endpoint (HWP/HWPX/PDF via kordoc lib,
  text files direct read, images flagged for OCR)
- Add queue consumer with APScheduler (1min interval, stage chaining
  extract→classify→embed, stale item recovery, retry logic)
- Add extract worker (kordoc HTTP call + direct text read)
- Add classify worker (Qwen3.5 AI classification with think-tag
  stripping and robust JSON extraction from AI responses)
- Add embed worker (GPU server nomic-embed-text, graceful failure)
- Add DEVONthink migration script with folder mapping for 16 DBs,
  dry-run mode, batch commits, and idempotent file_path UNIQUE
- Enhance ai/client.py with strip_thinking() and parse_json_response()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 14:35:36 +09:00
Hyungi Ahn
131dbd7b7c feat: scaffold v2 project structure with Docker, FastAPI, and config
동작하는 최소 코드 수준의 v2 스캐폴딩:

- docker-compose.yml: postgres, fastapi, kordoc, frontend, caddy
- app/: FastAPI 백엔드 (main, core, models, ai, prompts)
- services/kordoc/: Node.js 문서 파싱 마이크로서비스
- gpu-server/: AI Gateway + GPU docker-compose
- frontend/: SvelteKit 기본 구조
- migrations/: PostgreSQL 초기 스키마 (documents, tasks, processing_queue)
- tests/: pytest conftest 기본 설정
- config.yaml, Caddyfile, credentials.env.example 갱신

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 10:20:15 +09:00