hyungi_document_server

Author	SHA1	Message	Date
Hyungi Ahn	c6335c9a1e	fix(classify): law_monitor skip 분기 복원 + tier_backfill law 제외 PR-B refactor 과정에서 `e88640d` 의 process() 진입부 source_channel='law_monitor' skip 분기가 사라져 매일 07:00 신규 법령 분할마다 26B legacy classify(8s) + 26B legacy summarize(10s) + 4B triage(1.5s) 전부 호출되고 있었다. 법령 분리 PR (stateless-churning-raccoon) 의 명제: "법령은 외부 source-of-truth + immutable + 자동 재수집 → 다른 수명주기" 와 일치하도록 process() 진입부에 skip 분기 복원. 최소 필드 (ai_domain='법령', ai_tags=['법령'], importance='medium') 만 세팅 후 return. queue_consumer 의 NEXT_STAGES['classify']=['embed','chunk'] 가 자동 chain 하므로 검색 영향 0. 법령 도메인 AI 산출물 가치 분석: - ai_summary: 법령 해석 환각 위험 (ASME/안전 엔지니어 사고 책임 소지) - ai_tldr/bullets: 이미 title 이 같은 정보 노출 — redundant - ai_inconsistencies: 공식 정합 문서라 100% false positive → 비용 (월 ~14분 26B 점유) 대비 가치 음수, skip 합당. tier_backfill.py 도 함께 수정: - DOMAIN_PRIORITY 에서 ('law', source_channel='law_monitor') 항목 제거 - safety 필터에 source_channel != 'law_monitor' 추가 (기존 ai_domain LIKE 'Industrial_Safety%' 매칭 안에 backfill 기 처리한 법령 doc 들이 잡혀 들어가는 case 차단) - 사유: skip 처리될 doc 을 enqueue 하면 야간마다 enqueue→skip→NULL→ enqueue 무한 루프 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:35:27 +09:00
Hyungi Ahn	e2b32fe9b7	fix(ai): B-1 R2 risk_flag_requires_26b 를 hard escalate 로 승격 실측 발견 (safety 8건 재분류): - 10574 KRAS (safety_operational) → escalate=true (guard 전 pass) - 10568 JSA (safety_operational) → escalate=false suppressed=True - 10570 PPE (safety_operational) → escalate=false suppressed=True - 동일 도메인인데 4건 중 1건만 26B 처리. 같은 질의 종류 문서가 누구는 깊이 있고 누구는 짧음 → 사용자 관점 일관성 붕괴. 원인: risk_flag_requires_26b 가 soft escalate 분류 → R2 backlog guard 의 ratio 임계치(0.3) 에 걸림. 방금 classify 8건 enqueue 중 앞선 건들이 deep_summary 큐 채우자 뒤 건들이 전부 suppress. 수정: HARD_ESCALATE_REASONS 에 risk_flag_requires_26b 추가. safety/ health/chemical 등 도메인 정책 기반 escalate 는 절대 억제하지 않음. soft 영역은 여전히 남아있음: self_declare (4B 자가선언), deep_requested (recommend_deep_summary). 이 둘만 backlog guard 가 억제 대상. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 12:33:12 +09:00
Hyungi Ahn	165b00f917	fix(ai): B-1 subject_domain 매칭 + RoutingDecision.escalate_to_26b 존중 실측 발견 (safety md 8건 tier triage 결과): 1. 분류 오분류: 본문에 "MSDS" 한 번 스쳐도 msds 도메인 매칭됨. 개인보호구/중대재해/밀폐공간/산업안전보건법 전부 msds 로 잘못 판정. 2. RoutingDecision 무시: PR-A domain_policy 의 high_impact=true 와 risk_flag_requires_26b 때문에 RoutingDecision.escalate_to_26b=True 이지만 내 _classify_escalation_reason 이 이걸 안 봐서 escalate=False 로 마감. safety/msds/hazard_specific 전부 4B 만 돌고 26B 정책 우회. 수정: - _match_subject_domain: (a) title 기반 매칭 우선 추가 — 파일명이 의도의 1차 시그널. (b) 본문 키워드는 2회 이상 등장해야 match (single-mention 오분류 방지). 우선순위도 재배열 (msds 맨 앞 → hazard/safety 뒤로). - _classify_escalation_reason: routing_decision 파라미터 추가. 4B 자체 판정 (long_context / low_confidence / self_declare / deep_requested) 이후 PR-A routing_decision.escalate_to_26b 가 True 이면 그 escalation_reasons 중 "high_impact" 외의 구체 사유(risk_flag_requires_26b 등) 를 채택. - _run_tier_triage: routing_decision 을 먼저 계산하여 _classify_escalation_reason 에 전달. _apply_triage_result 는 routing_decision 을 param 으로 받음 (중복 계산 제거). 이 변경 후 safety/msds/hazard_specific/incident_report 도메인 문서는 항상 26B escalate → deep_summary 큐. MLX 부하 증가하지만 plan 의도대로 정책 준수. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 11:18:59 +09:00
Hyungi Ahn	f872e4666f	fix(ai): B-1 envelope.from_stage PR-A enum 값으로 정정 doc 5260 (confidence 0.3 low_confidence 에스컬레이션) 실측에서 발견: EscalationEnvelope(from_stage='summary_triage') 가 PR-A ValidFromStage ({triage, summarize_short, advice_trigger, classify, night_sweep, ask_pre, unknown}) 에 없어 ValueError 발생 → 모든 deep_summary enqueue 가 envelope 생성 단계에서 터짐. tldr/bullets 기록은 envelope 실패 전에 완료되어 영향 없음 (try/except 가 classify 전체는 보호). P3a short summary 에서의 에스컬레이션 의미에 맞춰 'summarize_short' 로 변경. 내부 task 이름 (SUMMARY_TRIAGE_TASK = 'p3a_short_summary') 는 analyze_events. prompt_version 기록 전용이라 그대로 유지. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 11:04:47 +09:00
Hyungi Ahn	6fdc48e5b6	feat(ai): B-1 summary tier 분할 — triage(4B) + deep_summary(26B) PR-A policy 레이어를 재사용하여 classify_worker 에 tier triage 경로를 추가. Legacy ai_summary / ai_domain / ai_suggestion 은 유지 (회귀 0), tldr/bullets/ detail/inconsistencies 는 별도 필드로 분리. Migrations (156~160): - 156 documents: ai_tldr, ai_bullets, ai_detail_summary, ai_inconsistencies, ai_analysis_tier 5컬럼 - 157 process_stage 에 'deep_summary' ADD VALUE 단독 (Postgres 동일 트랜잭션 제약 회피) - 158 processing_queue.payload JSONB (envelope 전달) - 159 analyze_events 에 tier + suppressed_reason - 160 suppressed_reason partial index Models/ORM: - Document: 5컬럼 Mapped 추가 - ProcessingQueue: deep_summary enum 확장 + payload 필드, enqueue_stage 에 payload 옵션 - AnalyzeEvent: PR-A shadow 6컬럼 + PR-B tier/suppressed_reason Workers: - classify_worker: 기존 legacy 경로 뒤에 _run_tier_triage 추가. - _match_subject_domain(doc, text): source_channel + 본문 keywords + ai_domain prefix 로 PR-A policy 의 subject_domain 이름 결정 (category 매칭 금지). - R1 TriageOutput pydantic + JSON 깨짐 fallback (triage_json_invalid). - R2 _check_backlog_guard(): 30분 window ratio > threshold OR pending 초과면 soft escalate suppress. hard escalate 는 통과. - R3 _slice_text_ranges(): 260k 초과 시 head 120k + mid 20k + tail 120k 3조각. - escalate 시 EscalationEnvelope 구성 + {envelope, subject_domain} payload 로 deep_summary enqueue. - deep_summary_worker (신규): queue payload 에서 envelope + subject_domain 읽기 → render_26b("p3c_deep_summary", subject_domain) + MLX 호출 (llm_gate Semaphore(1) 경유) → ai_detail_summary + ai_inconsistencies 저장 + ai_analysis_tier='deep'. _filter_inconsistencies 로 허용 kind (version_drift / procedure_conflict / source_conflict / missing_basis) 만 통과 — 구매/계약 kind drop. - queue_consumer: workers dict 에 deep_summary 추가 + BATCH_SIZE=1. next_stages 는 건드리지 않음 — classify → embed/chunk 는 그대로, deep_summary 는 독립 체인. Telemetry: - record_analyze_event: subject_domain / risk_flags / escalation_reasons / confidence / policy_version / shadow_would_route_to / tier / escalated_to_26b / suppressed_reason 파라미터 확장. classify/deep worker 가 mode="summary_triage" 또는 "summary_deep" 로 기록. API: - DocumentResponse 에 ai_tldr / ai_bullets / ai_detail_summary / ai_inconsistencies / ai_analysis_tier 5필드 노출. Prompts: - classify.txt 에 DEPRECATED 주석만 추가 (파일 유지 — rollback 경로 보존). - PR-A 의 app/prompts/policy/p3a_short_summary.txt (4B) 와 p3c_deep_summary.txt (26B) 를 그대로 사용. 내 소유의 summary_triage.txt / summary_deep.txt 는 중복 이라 별도 커밋에서 제거하지 않고 바로 생성 전 삭제. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 10:22:40 +09:00
Hyungi Ahn	e88640d3d8	feat(category): law 카테고리 분리 — enum + backfill + classify skip - migrations/152: ALTER TYPE doc_category ADD VALUE 'law' (DDL only; PG16 단일-트랜잭션 제약상 backfill 은 별도) - models/document.py: Enum 에 'law' 추가 (7 활성 + 3 유보) - workers/law_monitor.py: Document(..., category='law') — 신규 유입부터 세팅 - workers/classify_worker.py: source_channel='law_monitor' early-return + 최소 필드 (ai_domain='법령', ai_tags=['법령'], importance='medium'). AI classify skip — 법령 구조 고정/외부 source of truth/자동 재수집 - scripts/backfill_category.py: law 분기 + WHERE re-target ((source_channel='law_monitor' AND category='document')) + VERIFY cat_law/law_source_count + fail 조건 - api/documents.py: default 목록 제외에 law_monitor 추가 (news 와 동일 패턴) - api/dashboard.py: documents count FILTER 에 law_monitor 제외 (category_counts.law 는 기존 GROUP BY category 로 자동 노출) - frontend/Sidebar.svelte: '법령 알림' 버튼 ?source=law_monitor → ?category=law (explicit category 경로가 default exclusion 을 skip) plan: ~/.claude/plans/stateless-churning-raccoon.md axis 원칙: category=UI 축, policy/telemetry=source_channel+ai_domain 축 (feedback_category_vs_ai_domain_axis.md) 배포 순서: push → GPU pull → compose up --build fastapi frontend → backfill --dry-run → --apply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 09:14:56 +09:00
Hyungi Ahn	8fdea88676	feat(documents): §1 category enum + ai_suggestion 승인 파이프 plan: ~/.claude/plans/luminous-sprouting-hamster.md §1 - migrations/143_category.sql: doc_category enum (6 활성 + 3 유보) + documents.category + documents.ai_suggestion JSONB + 2 idx. - app/models/document.py: category (Enum, create_type=False), ai_suggestion (JSONB). - app/prompts/classify.txt: document_type enum 에 7 실무 doctype 추가 (발주서/세금계산서/명세표/도면/증명서/계획서/시방서) + facet_doctype 필드 directive. - config.yaml: document_types 에 7 항목 추가 (worker 검증 통과). - app/workers/classify_worker.py: FACET_DOCTYPES / LIBRARY_SUGGESTION_DOCTYPES 상수, facet_doctype 파싱(기존값 미덮어씀), 발주서/세금계산서/명세표 감지 시 ai_suggestion={proposed_category=library, proposed_path=@library/ 거래/{YYYY}/{doctype}, source_updated_at=doc.updated_at.isoformat(), ...}. category / user_tags 자동 전이 금지 (suggestion-only). - app/api/documents.py: · DocumentResponse 에 category / ai_suggestion 노출 · GET /documents ?category=<cat> / ?has_suggestion / ?proposed_category (category 지정 시 기본 news/memo 제외 해제 — §2 승인 UI 계약) · GET /documents/library 를 Document.category=='library' 기반으로 재구현 (path subquery 는 user_tags 유지 — 분류 내부 서가 경로) · POST /documents/{id}/accept-suggestion — FOR UPDATE + idempotent no-op + dual 409 stale (payload source_updated_at / documents.updated_at) + user_tags idempotent append · DELETE /documents/{id}/suggestion — idempotent, stale 검사 없음 - scripts/backfill_category.py: dry-run / apply. 매핑(news/memo/@library/else) + 3-way 상대 검증 (all_rows==categorized, uncategorized==0, cat_library==has_library_tag — 자동 전이 금지 정책 검증). 남은 DoD (원격 배포 후): docker compose up → migration 143 적용 → backfill apply → smoke (drive_sync 발주서 업로드 suggestion 생성 / category 유지, accept-suggestion idempotency + 409 stale 두 벡터, /documents?category=library == /documents/library 건수 일치). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:32:01 +09:00
Hyungi Ahn	5070ac45ff	fix(extract): LibreOffice 추출 절단 제거 및 요약 입력 확대 - extract_worker: LibreOffice 15000자 절단 제거 (full text 저장 원칙) - classify_worker/summarize_worker: 요약 입력 15000→50000자 확대 - client.py: 길이 기반 Claude 자동전환 제거 (require_explicit_trigger 정책 준수) _call_chat의 primary→fallback(exaone3.5) 체인은 유지 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:00:23 +09:00
Hyungi Ahn	5c58778a41	feat(library): doc_purpose 필드 + 자료실 업로드 기능 지식/업무 문서 1차 구분을 위한 doc_purpose(business\|knowledge) 추가. - 마이그레이션: document_purpose enum + 컬럼 - AI 분류: docPurpose 자동 추론 (빈 값만 채움) - 업로드 API: doc_purpose + library_path Form 파라미터 - 자료실 업로드: business 기본값 + 선택 경로 자동 태깅 - FileInfoView: 용도 select (수동 변경, 실패 롤백) - DocumentCard: 업무/참조 배지 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:26:59 +09:00
Hyungi Ahn	24142ea605	fix: Codex 리뷰 5건 수정 (critical 1 + high 4) 1. [critical] config.yaml → settings 객체에서 taxonomy 로드 (import crash 방지) 2. [high] ODF 변환: file_path 유지, derived_path 별도 필드 (무한 중복 방지) 3. [high] 법령 분할: 첫 장 이전 조문을 "서문"으로 보존 4. [high] Inbox: review_status 필드 분리 (pending/approved/rejected) 5. [high] 삭제: soft-delete (deleted_at) + worker 방어 + active_documents 뷰 - 모든 조회에 deleted_at IS NULL 일관 적용 - queue_consumer: row 없으면 gracefully skip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 07:15:13 +09:00
Hyungi Ahn	6d73e7ee12	feat: 분류 체계 전면 개편 — taxonomy + document_type + confidence - config.yaml: 6개 domain × 3단계 taxonomy + 13개 document_types 정의 - classify.txt: 영문 프롬프트, taxonomy 경로 기반 분류 + 분류 규칙 주입 - classify_worker: taxonomy 검증, confidence 기반 분류, document_type 저장 - migration 008: document_type, importance, ai_confidence 컬럼 - API: DocumentResponse에 document_type, importance, ai_confidence 추가 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 13:32:20 +09:00
Hyungi Ahn	733f730e16	fix: preview enum 누락 + AI summary thinking 제거 + CLAUDE.md 전면 갱신 - queue.py: process_stage enum에 'preview' 추가 - classify_worker: ai_summary에 strip_thinking() 적용 - CLAUDE.md: 현재 아키텍처 전면 반영 (파이프라인, UI, 인프라, 코딩규칙) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 12:38:59 +09:00
Hyungi Ahn	6893ea132d	refactor: preview 병렬 트리거 + 파일 이동 제거 + domain 색상 바 - queue_consumer: extract 완료 시 classify + preview 동시 등록 - classify_worker: _move_to_knowledge() 제거, 파일 원본 위치 유지 - DocumentCard: 좌측 domain별 색상 바 (4px) 추가 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 12:31:57 +09:00
Hyungi Ahn	d93e50b55c	security: fix 5 review findings (2 high, 3 medium) HIGH: - Lock setup TOTP/NAS endpoints behind _require_setup() guard (prevented unauthenticated admin 2FA takeover after setup) - Sanitize upload filename with Path().name + resolve() validation (prevented path traversal writing outside Inbox) MEDIUM: - Add score > 0.01 filter to hybrid search via subquery (prevented returning irrelevant documents with zero score) - Implement Inbox → Knowledge file move after classification (classify_worker now moves files based on ai_domain) - Add Anthropic Messages API support in _request() (premium/Claude path now sends correct format and parses content[0].text instead of choices[0].message.content) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 15:33:31 +09:00
Hyungi Ahn	299fac3904	feat: implement Phase 1 data pipeline and migration - Implement kordoc /parse endpoint (HWP/HWPX/PDF via kordoc lib, text files direct read, images flagged for OCR) - Add queue consumer with APScheduler (1min interval, stage chaining extract→classify→embed, stale item recovery, retry logic) - Add extract worker (kordoc HTTP call + direct text read) - Add classify worker (Qwen3.5 AI classification with think-tag stripping and robust JSON extraction from AI responses) - Add embed worker (GPU server nomic-embed-text, graceful failure) - Add DEVONthink migration script with folder mapping for 16 DBs, dry-run mode, batch commits, and idempotent file_path UNIQUE - Enhance ai/client.py with strip_thinking() and parse_json_response() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 14:35:36 +09:00

15 Commits