hyungi_document_server

Author	SHA1	Message	Date
hyungi	c11f113cf1	fix(workers): silent completion 차단 — transient re-raise + enqueue 격리 (R3) worker_fn 이 transient 실패를 삼켜 정상 반환하면 queue_consumer 가 status=completed 로 확정 → 영구 데이터 손실 + 재시도/추적 0. 정본(extract/marker/fulltext/stt 는 re-raise)과 어긋난 곳을 통일: - deep_summary: 호출 실패(call_failed)를 삼키지 않고 raise → 재시도→failed dead-letter (이전엔 ai_detail_summary 영구 누락 + tier triage 고착). - thumbnail: _extract_thumbnail 실패를 silent return → raise (썸네일 영구 누락 방지). - queue_consumer: 완료 커밋 후 enqueue_next_stage(정상·skip-note 2곳)를 자체 try 로 격리 — enqueue 실패가 outer except 로 전파돼 completed 항목을 재오픈(stage 재실행) 하던 결함 차단. 실패는 ERROR 로 가시화. - broad except 에 asyncio.CancelledError 명시 통과(embed worker / ask classifier·verifier). dead-letter = ProcessingQueue.status='failed'(기존 attempts/max_attempts 머신 재사용, 신규 컬럼 불필요). 검증: py_compile 통과. 큐 재시도 의미 synthetic smoke(staging) 예정. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-16 13:24:25 +09:00
hyungi	a82b0724df	fix(news): digest/briefing 생성 LLM 타임아웃 게이트 단일소스화 + deep_summary 컨슈머 분리 2026-06-11 맥미니 모델 교체(Gemma4 26B→Qwen3.6-27B-6bit, 콜당 ~90~300s)의 타임아웃 상향 sweep 이 config.yaml/synthesis 만 갱신하고 digest/briefing 코드의 하드코딩 LLM_CALL_TIMEOUT=25(빠른 Gemma 기준)를 누락 → digest 600s 하드캡 초과로 06-10 이후 미생성, briefing 4/4 LLM 폴백(status=failed). (적대 리뷰로 블로커 정정: concurrency=1 사설 세마포로는 digest 44~68 클러스터가 하드캡에 여전히 걸림 + llm_gate 영구 룰 위반.) - 타임아웃·재시도·하드캡을 config.pipeline 단일소스로 이관(digest_llm_timeout_s=300, attempts=2, pipeline_hard_cap_s=3000). 다음 모델 교체 때 재발 차단. - digest/briefing LLM 호출을 사설 Semaphore 제거하고 전역 MLX gate(BACKGROUND) 경유로 변경 — llm_gate 영구 룰(같은 endpoint 단일 게이트, 새 Semaphore 금지) 준수 + ask/eid(FOREGROUND)와 조율. 동시성 lever = 기존 mlx_gate_concurrency 2→4 (continuous batching 실측 — 3동시콜 wall 121s ≈ 단일콜, 직렬 대비 ~3배). - digest/briefing pipeline cluster 루프를 asyncio.gather 동시 실행으로 전환 (실동시성은 게이트가 제한, rank/순서 보존). - deep_summary(70~300s)를 메인 consume_queue 에서 분리해 consume_deep_queue 신설 (markdown/fast split 선례) — 단일 deep 호출이 1분 틱 초과로 메인 큐를 영구 coalesce 시키던 문제 제거. - 죽은 PIPELINE_HARD_CAP=600(briefing/pipeline.py) 제거, summarizer docstring 갱신, deep 컨슈머 disjoint/hold 테스트 추가. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-14 23:29:56 +00:00
hyungi	5dca5b5d28	ops(pipeline): embed/chunk 고속 컨슈머 분리 + 배치 1→10 — LLM 사이클 인질 해소 진단(2026-06-12 용량 평가): 단일 루프에서 classify(~190s×3)가 사이클을 점유, 건당 <1s 인 embed/chunk 가 사이클당 1건 캡 → 실효 ~580/일 vs 수요 최대 2,700/일, 적체 3,570 + 신규 문서 벡터 미적재(RAG 검색 누락). 4070 가동률 0% = 순수 구조 캡. 수리 = markdown 분리(05-01) 선례: consume_fast_queue 1분 잡 + 배치 10(GPU 공유 보수값, 캡 ~14,400/일). 세 컨슈머 stage 집합 disjoint(stale reset 이중 복구 방지). retrieval 로직·임베딩 모델 무접촉. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 07:50:07 +09:00
hyungi	cd0040925a	ops(pipeline): 생성 LLM 홀드 게이트 held_stages — 맥미니 모델 확정까지 보류 맥북 LLM 백지화 + 맥미니 모델 재결정에 따라 DS 의 생성 LLM 소비를 일괄 보류. held = classify/summarize/deep_summary(큐, claim 미발생·attempts 미소모) + digest(04:00)/briefing(05:10) cron + study explanation/session_analysis/memo_card 컨슈머. GPU 특화 스테이지·수집기·인터랙티브(ask/eid chat)는 무영향. 기본값 [] = 무동작. /api/digest/regenerate 는 홀드 중 409 명시. 해제 = config held_stages 비우고 fastapi 재기동. exec plan: ~/.claude/plans/ds-llm-hold-exec-20260611.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 16:52:46 +09:00
hyungi	88e5893041	feat(workers): 맥북 M5 Max 분담 배선 — deep 슬롯 + 보류 시멘틱 + queue_drain CLI plan ds-macbook-offload-1 P2 (Soft Lock 예외 박제 ds-macbook-offload-exec-20260611.md): - config ai.models.deep optional 슬롯 (라우터 :8890 경유 qwen-macbook, 부재 시 기존 경로) - AIClient.call_deep + is_deferrable_error + call_deep_or_defer (자동 cloud/맥미니 폴백 0) - deep_summary_worker: deep 슬롯 시 맥북 경유 (맥미니 mlx gate 미점유) + 실모델 기록 - StageDeferred 보류 시멘틱: 503/connect/read-timeout(sleep 절단) = attempts 미소모 + payload.deferred_until 30분 백오프, doc 쓰기는 완주+파싱 후 단일 커밋 (부분 쓰기 0) - queue_consumer: claim 에 deferred 필터 + StageDeferred 분기 - workers.queue_drain: 수동 burst-drain CLI (summarize/deep_summary, SKIP LOCKED 단건 claim, per-item 커밋, 보류 시 run 종료, deep 슬롯 필수 가드) - tests 20건 + 라우터 경유 Qwen 실응답 fixture 박제 (13.2s 라이브) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 12:55:16 +09:00
hyungi	1842f27d89	feat(news): crawl-24x7 사이클 2 — B-2/B-3/C-1/C-2/C-3/C-5 (마이그 324-326) - 채널 인지화: news_sources.source_channel(324, documents enum 재사용) → 문서 생성 정체성(_doc_identity)·embed/chunk 30일 게이트(crawl=전량 색인)· extract 후속 override(crawl→classify, preview 스킵) 분기. - B-2 Guardian Open Platform: API 디스패치(호스트 분기, 미지 호스트=명시 실패) + show-fields=bodyText 전문 어댑터. fixture live 박제 + call-shape 테스트. - B-3 구독지: playwright-fetcher 격리 컨테이너(동시 1·요청당 브라우저·storage_state ro mount) + politeness 사람속도(30-60s) 브라우저 경로 + fulltext 인증 라우팅 (내용 기반 probe 게이트·relogin_requested 소비=open-스킵보다 앞·본문 페이월 마커 게이트) + source_health probe 컬럼(325) + 세션 박제 스크립트(맥북용). - C-2 KOSHA: 3 API live 검증·fixture 박제(board/attach/guide) — 재해사례 daily diff +첨부 PDF/HWP→extract 파이프라인, GUIDE 일일 cap 점진 백필(silent cap 금지 로그). 키는 URL 직결합(재인코딩 함정 회피). daily 06:40 KST. - C-3 정적 코퍼스: National Board 86 + TWI job-knowledge 153 일괄 CLI(멱등·politeness ·crawl_raw 보존·fulltext_worker 승격 필드 규약 동일). - C-1/C-5 시드(326): 전 URL live 검증 — UK HSE(feed-full)/안전신문/고용노동부 3종 (rss/*.do)/OSHA/EU-OSHA(후보)/SEP/1000-Word(feed-full)/Doing Philosophy/Aeon/Psyche (skip-video quirk). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:08:18 +09:00
hyungi	7cd8cfde0a	feat(news): crawl-24x7 A그룹 — 레지스트리 증축·조건부 GET·fulltext 승격·politeness·source_health A-3 migrations 319-323 (news_sources 9컬럼 + source_channel 'crawl' + process_stage 'fulltext' + source_health) A-1 조건부 GET(ETag/Last-Modified 그대로 재전송)+콘텐츠 해시 변경감지, A-4 politeness 코어(per-domain 직렬+robots+정직UA), A-2+A-7 fulltext_worker(4-tier 재사용·NAS crawl_raw gzip 보존·격하 경로·03:40 reconcile 안전망), A-5 circuit breaker(3/10 임계, enabled 미터치), A-6 포털 전재 2차 dedup(제목+3일, 12자 게이트). 기존 소스 fulltext_policy='none' 기본 = 무회귀. plan crawl-24x7-1, 예외 박제 crawl-24x7-exec1-20260610.md	2026-06-10 13:03:31 +09:00
hyungi	0854c72c70	fix(search): sync doc md_status to failed on permanent markdown queue failure marker_worker 는 변환 시작 시 doc.md_status=processing 으로 표시하는데, 변환이 _fail()/_set_skipped() 를 거치지 않고 예외(예: 대형 batch ReadTimeout)로 죽으면 queue_consumer 가 큐 행만 failed 처리하고 doc.md_status 는 processing 에 영구 고착 = orphan (큐 failed, 문서 processing). markdown consumer 분리 후 이 orphan 이 tail 재처리에서 재발(5149/5201)하여 근본 원인 차단. _process_stage except 블록에서 큐 항목이 영구 실패(attempts>=max)할 때 stage가 markdown 이고 doc.md_status=processing 이면 failed 로 동기화. 재시도 중 (attempts<max)엔 pending 큐 행이 남아 orphan 아니므로 미터치. 검증: synthetic 영구 실패 경로 → md_status processing→failed 동기화 PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:06:32 +00:00
hyungi	2edc80d4bb	fix(search): split markdown into dedicated queue consumer to prevent pipeline stall 대형 PDF split 변환(5210 ≈ 40분 실측)이 단일 consume_queue 코루틴을 점유해 extract/classify/embed/chunk 등 전 파이프라인을 stall 시키던 문제 제거. - consume_markdown_queue 신규 — markdown 전용 scheduler job (id=markdown_consumer) - consume_queue 는 MAIN_QUEUE_STAGES (markdown 제외) 만 처리 - _process_stage / _load_workers 헬퍼로 per-stage 로직 공유 - reset_stale_items(stages, threshold_minutes) 파라미터화: main=10min(markdown 제외), markdown=MARKDOWN_STALE_MINUTES(기본 120). marker_worker 는 heartbeat 미기록이라 40분 변환을 10분 stale 로 오인하던 함정 차단 - enqueue flow (classify -> embed,chunk,markdown) 불변 STT/deep_summary 분리 + GPU 동시성 튜닝은 out of scope (follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:33:45 +00:00
Hyungi Ahn	0cbba0ceeb	feat(ingest): devonagent 트랙 Phase 1 ingest 활성화 DEVONagent/DEVONthink 가 발견한 웹페이지를 NAS Web/ drop → file_watcher ingest → extract 4-tier fallback (trafilatura/sibling-md/readability/bs4) → embed + chunk 까지. classify/preview/markdown SKIP. - source_channel='devonagent' (migration 001 dormant 활성화) - file_watcher: SCAN_TARGETS 통합 + Web/ rglob + canonical_url dedup + sidecar 누락 정책 (skip 안 함, web_meta.sidecar_missing=true flag) - extract_worker: HTML+devonagent 분기 + md_extraction_engine 4-tier 구분 (trafilatura → sibling .md ≥200char → readability+markdownify → bs4_text) - queue_consumer: enqueue_next_stage 의 extract stage 만 source_channel- aware override (devonagent → [embed, chunk]) - classify_worker: devonagent safety skip (law_monitor 패턴 mirror, ai_domain='Web', ai_tags=['Web/{host}']) - requirements: trafilatura/readability-lxml/markdownify 추가 - docs: devonthink-web-bridge.md 설치 가이드 + first-wins 정책 명시 Phase 1 closure 기준 = 재료 품질 (검색 가능 + 노이즈율 + dedup + 엔진 분포). 활용처(ai_tldr/digest/PKM 회고)는 1-2주 OR 30-50건 관찰 후 별 PR 에서 결정. Plan: ~/.claude/plans/db-snuggly-petal.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 21:23:16 +09:00
Hyungi Ahn	e50869cbda	feat(canonical): Phase 1B marker-service + marker_worker for PDF→markdown (222) 신규 컨테이너 marker-service (port 3300, Marker 1.10.2 + surya 0.17.1 + HF cache volume). marker_worker 가 markdown stage 큐 소비: classify_worker → enqueue 'markdown' (leaf, embed/chunk 와 독립) → SKIP_DOC_TYPES (발주서/세금계산서/명세표) 스킵 → 확장자 != .pdf 스킵 (Phase 1B = PDF only) → page_count > 200 스킵 → marker-service POST /convert → 422/404 = doc-level failed, 5xx = queue retry 안정성 장치: - migration 222: ALTER TYPE process_stage ADD VALUE markdown (단일 statement) - md_extraction_quality JSONB dict 직접 저장 - skip 시 md_content/hash NULL 클리어 - /ready Response.status_code + warmup_error 가시화 - HF cache volume (build-time download 0) - file_path 는 NAS 상대경로 → /documents prefix prepend 성공 기준: 파이프라인 안정성. markdown 품질은 Phase 1D pilot. Pre-flight (2026-05-01): - marker-pdf 1.10.2 stable - file_path 9503건 NAS 상대경로 - DOCUMENT_TYPES 한국어 7종 → SKIP alias 보강 - queue retry max_attempts=3 + reset_stale_items 확인 - main 220/221 study_q_related 선점 → 222 rebump Plan: ~/.claude/plans/plan-idempotent-sundae.md (Round 5 approved) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:06:23 +00:00
Hyungi Ahn	6fdc48e5b6	feat(ai): B-1 summary tier 분할 — triage(4B) + deep_summary(26B) PR-A policy 레이어를 재사용하여 classify_worker 에 tier triage 경로를 추가. Legacy ai_summary / ai_domain / ai_suggestion 은 유지 (회귀 0), tldr/bullets/ detail/inconsistencies 는 별도 필드로 분리. Migrations (156~160): - 156 documents: ai_tldr, ai_bullets, ai_detail_summary, ai_inconsistencies, ai_analysis_tier 5컬럼 - 157 process_stage 에 'deep_summary' ADD VALUE 단독 (Postgres 동일 트랜잭션 제약 회피) - 158 processing_queue.payload JSONB (envelope 전달) - 159 analyze_events 에 tier + suppressed_reason - 160 suppressed_reason partial index Models/ORM: - Document: 5컬럼 Mapped 추가 - ProcessingQueue: deep_summary enum 확장 + payload 필드, enqueue_stage 에 payload 옵션 - AnalyzeEvent: PR-A shadow 6컬럼 + PR-B tier/suppressed_reason Workers: - classify_worker: 기존 legacy 경로 뒤에 _run_tier_triage 추가. - _match_subject_domain(doc, text): source_channel + 본문 keywords + ai_domain prefix 로 PR-A policy 의 subject_domain 이름 결정 (category 매칭 금지). - R1 TriageOutput pydantic + JSON 깨짐 fallback (triage_json_invalid). - R2 _check_backlog_guard(): 30분 window ratio > threshold OR pending 초과면 soft escalate suppress. hard escalate 는 통과. - R3 _slice_text_ranges(): 260k 초과 시 head 120k + mid 20k + tail 120k 3조각. - escalate 시 EscalationEnvelope 구성 + {envelope, subject_domain} payload 로 deep_summary enqueue. - deep_summary_worker (신규): queue payload 에서 envelope + subject_domain 읽기 → render_26b("p3c_deep_summary", subject_domain) + MLX 호출 (llm_gate Semaphore(1) 경유) → ai_detail_summary + ai_inconsistencies 저장 + ai_analysis_tier='deep'. _filter_inconsistencies 로 허용 kind (version_drift / procedure_conflict / source_conflict / missing_basis) 만 통과 — 구매/계약 kind drop. - queue_consumer: workers dict 에 deep_summary 추가 + BATCH_SIZE=1. next_stages 는 건드리지 않음 — classify → embed/chunk 는 그대로, deep_summary 는 독립 체인. Telemetry: - record_analyze_event: subject_domain / risk_flags / escalation_reasons / confidence / policy_version / shadow_would_route_to / tier / escalated_to_26b / suppressed_reason 파라미터 확장. classify/deep worker 가 mode="summary_triage" 또는 "summary_deep" 로 기록. API: - DocumentResponse 에 ai_tldr / ai_bullets / ai_detail_summary / ai_inconsistencies / ai_analysis_tier 5필드 노출. Prompts: - classify.txt 에 DEPRECATED 주석만 추가 (파일 유지 — rollback 경로 보존). - PR-A 의 app/prompts/policy/p3a_short_summary.txt (4B) 와 p3c_deep_summary.txt (26B) 를 그대로 사용. 내 소유의 summary_triage.txt / summary_deep.txt 는 중복 이라 별도 커밋에서 제거하지 않고 바로 생성 전 삭제. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 10:22:40 +09:00
Hyungi Ahn	1e2c004dd4	feat(media): §3 audio STT + video 재생 인프라 plan: ~/.claude/plans/luminous-sprouting-hamster.md §3 스키마: - migrations/147_audio_segments_table.sql: audio_segments (STT 타임스탬프 세그먼트) - migrations/148_audio_segments_idx.sql: (document_id, start_s) idx - migrations/149_document_media_cols.sql: documents.thumbnail_path + needs_conversion - migrations/150_queue_stage_stt.sql: process_stage += 'stt' - migrations/151_queue_stage_thumbnail.sql: process_stage += 'thumbnail' - app/models/audio_segment.py, document.py (thumbnail_path/needs_conversion) 서비스: - services/stt/{Dockerfile, requirements.txt, server.py} — faster-whisper large-v3 GPU 컨테이너. /transcribe (filePath/langs/beamSize) + /health + /ready (cuda device_count + model_loaded). NFC/NFD 경로 resolver (OCR 교훈). - docker-compose.yml: stt-service 추가 (GPU 1 예약, :3300, NAS ro mount, stt_models volume, start_period 300s), fastapi env 에 STT_ENDPOINT. 파이프라인 (의존 §1 category): - app/workers/stt_worker.py 신규: stage='stt' pickup → STT_ENDPOINT 호출 → extracted_text + audio_segments 저장. Timeout 30분. - app/workers/thumbnail_worker.py 신규: ffmpeg 50% 지점 1장 → PKM/Videos/.thumbs/{id}.jpg + thumbnail_path 세팅. needs_conversion=true 는 skip. - app/workers/file_watcher.py 확장: PKM/{Inbox, Recordings, Videos} 스캔. 확장자→category, audio→stage=stt, video .mp4/.webm→ stage=thumbnail, video .mov/.mkv/.avi→needs_conversion=true + stage 없음. settings.roon_library_path prefix skip. - app/workers/queue_consumer.py 확장: stt + thumbnail workers 등록, BATCH_SIZE(stt=1, thumbnail=3), next_stages 에 stt→[classify] 추가 (audio 는 extract 건너뜀). - app/Dockerfile: ffmpeg 추가 (썸네일 subprocess 용). API (의존 §1): - /api/audio/{id}/segments — AudioSegment ORDER BY start_s - /api/video/{id}/thumbnail — thumbnail_path FileResponse (쿼리 토큰) - /api/documents/{id}/file: media_types 에 audio/video mime 포함 (§2 커밋에 이미 포함). Starlette FileResponse 가 Range 자동. - upload_document: .mov/.mkv/.avi 웹 업로드 거부 (error_code unsupported_codec). NAS 드롭은 file_watcher 가 quarantine 수용. 프론트: - AudioPlayer.svelte: HTML5 audio + 전사 세그먼트 sticky 패널 + 줄 클릭 seek. activeIdx 하이라이트. - VideoPlayer.svelte: HTML5 video direct play + needs_conversion 안내 카드. poster 는 thumbnail endpoint. - /audio (목록 grid) + /audio/[id] (플레이어) - /video (썸네일 grid + 변환 필요 배지) + /video/[id] (플레이어) - Sidebar.svelte: Mic/Film 아이콘 + audio/video 네비 활성, count 배지 (§2 /stats/category-counts 재사용). 설정: - app/core/config.py: stt_endpoint + roon_library_path. DoD 배포 후 smoke: /ready cuda:true, 회의 mp3 transcribe, audio extract 없이 classify 진행(queue 회귀), /audio 재생, .mp4 재생, .mov 웹 400, .mov NAS quarantine, Sidebar 네비 + count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 06:47:36 +09:00
Hyungi Ahn	751cdc5be8	fix(queue): enqueue 경로 중복 방어 — partial unique index + 중앙 enqueue_stage 함수 기존 UNIQUE(document_id, stage, status)는 pending+processing 동시 존재를 허용해서 stale 복구 시 충돌 발생. 2-layer 방어로 근본 차단: 1) DB: partial unique index uq_queue_active — 활성 행(pending/processing)은 (document_id, stage)당 최대 1개만 허용 2) App: enqueue_stage() 중앙 함수 — INSERT ON CONFLICT DO NOTHING으로 모든 9개 경로의 check-then-insert TOCTOU race 제거 migration 117은 guard check 포함 — 활성 중복이 남아있으면 RAISE EXCEPTION 으로 중단, 수동 정리 유도. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 08:37:32 +09:00
Hyungi Ahn	8ec1e53ca4	fix(queue): reset_stale_items UniqueViolationError로 큐 소비 전체 중단 수정 stale processing 행을 pending으로 bulk UPDATE 시 이미 같은 (document_id, stage, pending) 행이 존재하면 unique constraint 위반으로 APScheduler consume_queue 잡 전체가 크래시. 2-step 접근으로 변경: 1) pending 중복 있는 stale processing 행은 DELETE 2) 나머지만 pending으로 UPDATE + 예외 삼키기로 stale reset 실패가 전체 큐 소비를 죽이지 않게 방어 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 07:41:20 +09:00
Hyungi Ahn	b46a75758b	feat(memos): 내장 메모 기능 — 파일 없는 문서(file_type='note') Document Server에 Memos 앱 대체 기능 내장. 메모를 documents 테이블의 file_type='note' 레코드로 관리하여 기존 AI 파이프라인(classify/embed/ chunk/search/ask) 재활용. Backend: - migration 105: source_channel 'memo', file_path NULL 허용, user_tags/pinned/ask_includable 컬럼, 메모 인덱스 - api/memos.py: CRUD 7개 엔드포인트 + #태그 파싱 + stale AI 초기화 + 큐 pending 중복 방지 - queue_consumer: note extract/preview skip - documents API: file_path NULL 가드, 목록에서 메모 제외 - search /ask: ask_includable=false 문서 evidence 제외 Frontend: - /memos 타임라인 페이지 (빠른 입력 + 피드 + 인라인 편집 + 태그 필터) - QuickMemoButton FAB (Ctrl+M, 모든 페이지) - Sidebar 메모 링크 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 16:00:00 +09:00
Hyungi Ahn	010e25cb23	fix(queue): doc-level embed metadata 기반 + NUL 바이트 strip + 빈 예외 fallback embed_worker: - extracted_text[:6000] → title + ai_summary + tags(top 5) metadata 입력 - 500k자 문서의 표지+목차가 임베딩되는 구조적 버그 해결 - Ollama 기본 context 안전 (~1500자 이하), num_ctx 조정 불필요 - ai_summary < 50자 시 본문 800자 fallback - ai_domain 은 초기 제외 (taxonomy 노이즈 방지) extract_worker: - kordoc / 직접 읽기 / LibreOffice 3 경로 모두 \x00 strip - asyncpg CharacterNotInRepertoireError 재발 방지 queue_consumer: - str(e) or repr(e) or type(e).__name__ fallback - 빈 메시지 예외(24건 발생) 다음부터 클래스명이라도 기록 plan: ~/.claude/plans/quiet-meandering-nova.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 13:45:55 +09:00
Hyungi Ahn	378fbc7845	feat(chunk): Phase 0.1 chunk 인덱싱 — ORM/worker/migration 정리 GPU 서버에 untracked로만 존재하던 Phase 0.1 코드를 정식 commit: - app/models/chunk.py — DocumentChunk ORM (country/source/domain 메타 포함) - app/workers/chunk_worker.py — 6가지 chunking 전략 (legal/news/markdown/email/long_pdf/default) - migrations/014_document_chunks.sql — pgvector + FTS + trigram 인덱스 - app/models/queue.py — ProcessingQueue enum에 'chunk' stage 추가 - app/workers/queue_consumer.py — chunk stage 등록, classify→[embed,chunk] 자동 연결 Phase 1 reranker 통합 작업의 전제 조건. document_chunks 테이블 기반 retrieval에 사용.	2026-04-07 13:26:37 +09:00
Hyungi Ahn	49cc86db80	feat: summarize 전용 stage — 뉴스 AI 요약 (classify 없이) - summarize_worker: 요약만 생성 (분류 안 함) - queue_consumer: summarize stage 추가 (batch 3) - news_collector: summarize + embed 큐 등록 - process_stage enum에 'summarize' 추가 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 15:00:14 +09:00
Hyungi Ahn	24142ea605	fix: Codex 리뷰 5건 수정 (critical 1 + high 4) 1. [critical] config.yaml → settings 객체에서 taxonomy 로드 (import crash 방지) 2. [high] ODF 변환: file_path 유지, derived_path 별도 필드 (무한 중복 방지) 3. [high] 법령 분할: 첫 장 이전 조문을 "서문"으로 보존 4. [high] Inbox: review_status 필드 분리 (pending/approved/rejected) 5. [high] 삭제: soft-delete (deleted_at) + worker 방어 + active_documents 뷰 - 모든 조회에 deleted_at IS NULL 일관 적용 - queue_consumer: row 없으면 gracefully skip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 07:15:13 +09:00
Hyungi Ahn	6893ea132d	refactor: preview 병렬 트리거 + 파일 이동 제거 + domain 색상 바 - queue_consumer: extract 완료 시 classify + preview 동시 등록 - classify_worker: _move_to_knowledge() 제거, 파일 원본 위치 유지 - DocumentCard: 좌측 domain별 색상 바 (4px) 추가 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 12:31:57 +09:00
Hyungi Ahn	4bea408bbd	feat: Markdown 편집기 + PDF 변환 파이프라인 + 뷰어 포맷 분기 - Markdown split editor: textarea + marked preview, Ctrl+S 저장 - PUT /api/documents/{id}/content: 원본 파일 저장 + extracted_text 갱신 - GET /api/documents/{id}/preview: PDF 미리보기 캐시 서빙 - preview_worker: LibreOffice headless → PDF 변환 (timeout 60s, retry 1회) - queue_consumer: preview stage 추가 (embed 후 자동 트리거) - DocumentViewer: 포맷별 분기 (markdown/pdf/preview-pdf/image/text/cad) - 오피스/CAD 문서: 새 탭 편집 버튼 - Dockerfile: LibreOffice headless 설치 - migration 005: preview_status, preview_hash, preview_at 컬럼 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 10:10:03 +09:00
Hyungi Ahn	62f5eccb96	fix: isolate each worker call in independent async session Shared session between queue consumer and workers caused MissingGreenlet errors in APScheduler context. Each worker call now gets its own session with explicit commit/rollback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:29:14 +09:00
Hyungi Ahn	299fac3904	feat: implement Phase 1 data pipeline and migration - Implement kordoc /parse endpoint (HWP/HWPX/PDF via kordoc lib, text files direct read, images flagged for OCR) - Add queue consumer with APScheduler (1min interval, stage chaining extract→classify→embed, stale item recovery, retry logic) - Add extract worker (kordoc HTTP call + direct text read) - Add classify worker (Qwen3.5 AI classification with think-tag stripping and robust JSON extraction from AI responses) - Add embed worker (GPU server nomic-embed-text, graceful failure) - Add DEVONthink migration script with folder mapping for 16 DBs, dry-run mode, batch commits, and idempotent file_path UNIQUE - Enhance ai/client.py with strip_thinking() and parse_json_response() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 14:35:36 +09:00

24 Commits