hyungi_document_server

Author	SHA1	Message	Date
hyungi	7d882352b8	fix(mineru): 변환/워밍 self-timeout + OOM·행 시 엔진 재워밍 escalate aio_do_parse 에 자체 타임아웃이 없어 vLLM 행 시 _engine_lock 을 영구 점유 → markdown 변환 전체 마비(컨테이너 재시작 전까지). 클라이언트(marker_worker)는 300s 로 포기하나 서버측 inflight 는 자동 취소 안 됨. - _run_mineru 를 asyncio.wait_for(convert 600s / warmup 1200s)로 감싸 lock 점유 상한. - 타임아웃·OOM/CUDA 류 실패 시 _warmup_done 리셋 → 다음 요청 재워밍. 재워밍도 실패하면 _warmup_error → /ready 503 → healthcheck 재시작으로 escalate(영구 degradation 차단). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 13:22:36 +09:00
hyungi	a77ac38e92	feat(extraction): 컷오버 Phase 2 — marker-service 제거 (MinerU 단독) 읽기뷰 회귀 0 확인(doc 39464 재처리 → engine=mineru success, 71 imgs, docimg ref/NAS persist 정상) 후 marker 제거. compose 에서 marker-service 블록 + fastapi depends_on + marker_models 볼륨 + services/marker/ 소스 삭제. 롤백 = git history + ~/.local/share/marker-decommission-backups. 마크다운 엔진 = mineru-service 단독. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 16:27:26 +09:00
hyungi	bb929f88d0	feat(extraction): MinerU 2.5 VLM 추출 서비스 + 워커 엔드포인트 env화 marker-service(Surya, ~10GB) 대체 후보. MinerU2.5-Pro-2605-1.2B VLM(vllm-async-engine, ~5.9GB 고정). marker /convert 계약 복제(file_path·start/end·md+base64 images) → 워커는 MARKER_ENDPOINT env 플립만으로 전환. 단일카드(16GB) 검색스택 공존, 40p 윈도우 무변. - services/mineru: Dockerfile(vllm/vllm-openai:v0.21.0 + mineru[core]) + async server.py (NFC/NFD 한글경로 resolver, PyMuPDF page 슬라이스, gpu_memory_utilization 캡) - docker-compose: mineru-service profile-gated(기본 미기동=marker 무영향) + mineru_models vol - marker_worker: MARKER_ENDPOINT 하드코딩 → env(기본 marker, 무변) 격리 PoC A/B 8/8 게이트 PASS (한국어/표/수식LaTeX/heading/figure/40p VRAM). 컷오버(env 플립+marker 제거)는 별 단계(읽기뷰 회귀 0 게이트). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-18 15:58:55 +09:00
hyungi	f3530e382d	fix(services): playwright-fetcher CF JS 챌린지 통과 대기 — aiche.org 인터스티셜 스냅샷 함정 (검증 게이트 발견) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 07:23:58 +09:00
hyungi	8583465c58	feat(news): crawl-24x7 사이클 3 — B-4 시그널·C-4 공학 지속·CSB sitemap·CCPS Beacon (마이그 327) - B-4 fetch_method='signal-only': 페이지 fetch 0 + summarize 스킵(검색 색인만, 맥미니 부하 0) + 본문 무절단(_entry_body — arXiv 초록 1.6K 보존). 다이제스트는 ai_summary NULL 제외 규칙으로 자연 배제. 레지스트리 오설정(page) 방어 가드. - 시드 9 소스 (전 URL 2026-06-11 live 검증): Bloomberg Markets/Technology(skip-video, 비디오 혼재 실측)·Economist Latest·Nikkei Asia(RDF — feedparser 네이티브, 분기 불요 fixture 박제)·ASME JPVT(site_1000037 실측 매핑)·arXiv 2종·IEEE Spectrum 2종(feed-full, 피드 description 이 전문 7.9~14K자 실측). - csb_collector: sitemap lastmod diff (weekly 월 06:50) — 워터마크(selector_override) + cap 40/회 점진 백필 + diff sanity 300 + 보고서 PDF(/assets/, recommendation 제외) → extract 파이프라인. 초기 일괄 = CLI --bulk. - api_standards_collector: 공지 목록 링크 파싱(실측 — 페이지 diff 아님, 상세 URL 10건/페이지) → 신규 상세만 ingest (monthly 5일 07:05). 초기 백필 = CLI --bulk. - ccps_collector: aiche.org 평문 403(UA 무관 실측) → playwright-fetcher 익명 컨텍스트 + referer 쿠키 승계 /download(base64) 신설로 월간 Beacon PDF (monthly 5일 07:20). 헤드리스 차단 시 CrawlBlocked → health 가시화 (르몽드 PARK 선례). - B-5 잔여: rdf/feed-reader-UA = 코드 분기 불요 실측 박제 (Economist 는 Archiver UA 200). table-strip/gn-redirect 는 해당 소스 미진입 — 백로그 유지. - 테스트 24건 신규 (fixture 9건 live 박제, economist/ieee 는 item trim) — 39 passed. - 마이그 327 단일 statement (PKM 트랙과 번호 경합 주의 — 327 본 트랙 선점). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 07:13:17 +09:00
hyungi	251a5392ef	fix(services): playwright-fetcher pwuser 실행 — root Chromium sandbox 함정 회피 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:11:03 +09:00
hyungi	1842f27d89	feat(news): crawl-24x7 사이클 2 — B-2/B-3/C-1/C-2/C-3/C-5 (마이그 324-326) - 채널 인지화: news_sources.source_channel(324, documents enum 재사용) → 문서 생성 정체성(_doc_identity)·embed/chunk 30일 게이트(crawl=전량 색인)· extract 후속 override(crawl→classify, preview 스킵) 분기. - B-2 Guardian Open Platform: API 디스패치(호스트 분기, 미지 호스트=명시 실패) + show-fields=bodyText 전문 어댑터. fixture live 박제 + call-shape 테스트. - B-3 구독지: playwright-fetcher 격리 컨테이너(동시 1·요청당 브라우저·storage_state ro mount) + politeness 사람속도(30-60s) 브라우저 경로 + fulltext 인증 라우팅 (내용 기반 probe 게이트·relogin_requested 소비=open-스킵보다 앞·본문 페이월 마커 게이트) + source_health probe 컬럼(325) + 세션 박제 스크립트(맥북용). - C-2 KOSHA: 3 API live 검증·fixture 박제(board/attach/guide) — 재해사례 daily diff +첨부 PDF/HWP→extract 파이프라인, GUIDE 일일 cap 점진 백필(silent cap 금지 로그). 키는 URL 직결합(재인코딩 함정 회피). daily 06:40 KST. - C-3 정적 코퍼스: National Board 86 + TWI job-knowledge 153 일괄 CLI(멱등·politeness ·crawl_raw 보존·fulltext_worker 승격 필드 규약 동일). - C-1/C-5 시드(326): 전 URL live 검증 — UK HSE(feed-full)/안전신문/고용노동부 3종 (rss/*.do)/OSHA/EU-OSHA(후보)/SEP/1000-Word(feed-full)/Doing Philosophy/Aeon/Psyche (skip-video quirk). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:08:18 +09:00
hyungi	3df0ca53ab	feat(services): crawl-24x7 A-8 헬스 패널 + D-1 stt/marker idle-unload A-8 1차: crawl-health 컨테이너(100.110.63.63:8765 Tailscale 바인딩 전용, 읽기 전용 SELECT, caddy 라우트 금지). D-1 전제 작업: STT_PRELOAD=0+30분 유휴 해제(lock+inflight+reaper), marker MARKER_PRELOAD=0+idle-unload, /ready idle=200(503=warmup_failed 한정 — fastapi depends_on 정합), healthcheck cuda 기준 전환.	2026-06-10 13:03:31 +09:00
hyungi	2528996dee	feat(marker): support page-range conversion in /convert ConvertRequest.start_page/end_page (1-based inclusive); per-request PdfConverter with config page_range, reuses loaded models. 1-based->0-based contained in marker adapter. PR-DocSrv-LargeDoc-Split-Markdown-1 commit 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 07:01:34 +00:00
Hyungi Ahn	98ee7dffe2	ops(gpu-health): GPU 서비스 health/smoke 표준화 + synthetic VRAM 피크 가드 PR-GPU-Health-1. 운영 준비성 표준화 PR (모델 성능 개선 아님). - OCR /smoke endpoint 추가 (160x60 OK PNG in-memory, 200/503 분기, Docker healthcheck 미사용) - marker /health endpoint 추가 (stt/ocr 동일 시그니처) - reranker docker-compose healthcheck 추가 (TEI :80/health) - scripts/gpu_service_smoke.sh: docker exec 표준 점검 (OCR/STT expose-only) - scripts/gpu_vram_fixture.sh: Mode A sequential + Mode B light overlap + --stress 옵션 - tests/load/fixtures/: synthetic ocr_ok.png / sine_30s.wav / lorem_1p.pdf OCR 빈 응답 false negative — root cause: ports 미매핑. 결정: ocr-service / stt-service 는 expose-only 유지, 운영 점검은 docker exec 내부 curl 표준. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 09:42:07 +09:00
Hyungi Ahn	68fa86ea52	feat(markdown): persist extracted images with auth routes Markdown Canonical Phase 1B.5 — marker 가 추출하던 이미지를 NAS 에 영구 저장하고 DB 메타 + 인증 라우트 + 프론트 swap 까지 wiring. 핵심 변경: - marker-service /convert 응답에 base64 image 리스트 포함 (stateless 유지, NAS write 권한 X) - marker_worker 가 NAS `/documents/extracted_images/{doc_id}/` 에 persist + UPSERT + 고아 row DELETE + md_content ref 를 `docimg:img_NNN` stable scheme 으로 정규화 - /api/documents/{id}/images/{key}/raw 인증 라우트 (Cache-Control private + ETag = content_hash) - frontend MarkdownDoc 가 placeholder card 안의 docimg ref 를 실제 <img> 로 swap 원칙: - 이미지 binary = NAS, metadata = Postgres (학습 섹션 패턴 동일) - image_key sequence 기반 결정적 → 재변환 idempotent - MARKDOWN_IMAGE_PERSIST=false env 로 rollback 가능 (placeholder card 폴백 자연 유지) 기존 28건 marker success 문서는 본 PR 에서 건드리지 않음 — deploy + 신규 업로드 1건 + sample 5건 검증 후 scripts/marker_reprocess_existing_success.py 로 targeted reprocess. plan: ~/.claude/plans/piped-humming-crystal.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 14:05:41 +09:00
Hyungi Ahn	e66addf975	fix(canonical): marker engine_version via importlib.metadata marker module 이 __version__ attribute 를 노출하지 않아 ship gate 10 에서 engine_version="unknown" 으로 표시되던 cosmetic 문제. importlib.metadata. version("marker-pdf") 로 패키지 버전 정확히 읽음. 테스트: ship gate 10 PASS 확인 후 재배포. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:19:46 +00:00
Hyungi Ahn	e50869cbda	feat(canonical): Phase 1B marker-service + marker_worker for PDF→markdown (222) 신규 컨테이너 marker-service (port 3300, Marker 1.10.2 + surya 0.17.1 + HF cache volume). marker_worker 가 markdown stage 큐 소비: classify_worker → enqueue 'markdown' (leaf, embed/chunk 와 독립) → SKIP_DOC_TYPES (발주서/세금계산서/명세표) 스킵 → 확장자 != .pdf 스킵 (Phase 1B = PDF only) → page_count > 200 스킵 → marker-service POST /convert → 422/404 = doc-level failed, 5xx = queue retry 안정성 장치: - migration 222: ALTER TYPE process_stage ADD VALUE markdown (단일 statement) - md_extraction_quality JSONB dict 직접 저장 - skip 시 md_content/hash NULL 클리어 - /ready Response.status_code + warmup_error 가시화 - HF cache volume (build-time download 0) - file_path 는 NAS 상대경로 → /documents prefix prepend 성공 기준: 파이프라인 안정성. markdown 품질은 Phase 1D pilot. Pre-flight (2026-05-01): - marker-pdf 1.10.2 stable - file_path 9503건 NAS 상대경로 - DOCUMENT_TYPES 한국어 7종 → SKIP alias 보강 - queue retry max_attempts=3 + reset_stale_items 확인 - main 220/221 study_q_related 선점 → 222 rebump Plan: ~/.claude/plans/plan-idempotent-sundae.md (Round 5 approved) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 00:06:23 +00:00
Hyungi Ahn	cec464ae2d	fix(media): §3 ship-readiness — stt preload + healthcheck + queue enum + dashboard queue_lag stt: - services/stt/server.py: lazy → eager preload in FastAPI lifespan. STT_PRELOAD=0 으로 lazy 강제 가능 (개발/테스트). preload 실패해도 프로세스는 살아 있고 /ready false 로 남아 healthcheck 가 unhealthy 처리. - docker-compose.yml: healthcheck /health → /ready. /health 는 단순 liveness 라 모델 미적재 상태도 healthy 로 잡혀 운영 신호 부적합. queue ORM: - app/models/queue.py: process_stage enum 에 'stt'/'thumbnail' 추가 + create_type=False (migration 150/151 가 DB enum 확장 담당). 이게 없으면 stt_worker INSERT 시 SQLAlchemy 가 enum value 를 거부. dashboard 강화 (§4 선제, §3 신규 stage 까지 자동 커버): - app/api/dashboard.py: category_counts + library_pending_suggestions + queue_lag (stage 별 pending/processing/failed + oldest_pending_age_sec). - frontend/src/lib/stores/system.ts: QueueLag 타입 + DashboardSummary 확장. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 07:04:52 +09:00
Hyungi Ahn	1e2c004dd4	feat(media): §3 audio STT + video 재생 인프라 plan: ~/.claude/plans/luminous-sprouting-hamster.md §3 스키마: - migrations/147_audio_segments_table.sql: audio_segments (STT 타임스탬프 세그먼트) - migrations/148_audio_segments_idx.sql: (document_id, start_s) idx - migrations/149_document_media_cols.sql: documents.thumbnail_path + needs_conversion - migrations/150_queue_stage_stt.sql: process_stage += 'stt' - migrations/151_queue_stage_thumbnail.sql: process_stage += 'thumbnail' - app/models/audio_segment.py, document.py (thumbnail_path/needs_conversion) 서비스: - services/stt/{Dockerfile, requirements.txt, server.py} — faster-whisper large-v3 GPU 컨테이너. /transcribe (filePath/langs/beamSize) + /health + /ready (cuda device_count + model_loaded). NFC/NFD 경로 resolver (OCR 교훈). - docker-compose.yml: stt-service 추가 (GPU 1 예약, :3300, NAS ro mount, stt_models volume, start_period 300s), fastapi env 에 STT_ENDPOINT. 파이프라인 (의존 §1 category): - app/workers/stt_worker.py 신규: stage='stt' pickup → STT_ENDPOINT 호출 → extracted_text + audio_segments 저장. Timeout 30분. - app/workers/thumbnail_worker.py 신규: ffmpeg 50% 지점 1장 → PKM/Videos/.thumbs/{id}.jpg + thumbnail_path 세팅. needs_conversion=true 는 skip. - app/workers/file_watcher.py 확장: PKM/{Inbox, Recordings, Videos} 스캔. 확장자→category, audio→stage=stt, video .mp4/.webm→ stage=thumbnail, video .mov/.mkv/.avi→needs_conversion=true + stage 없음. settings.roon_library_path prefix skip. - app/workers/queue_consumer.py 확장: stt + thumbnail workers 등록, BATCH_SIZE(stt=1, thumbnail=3), next_stages 에 stt→[classify] 추가 (audio 는 extract 건너뜀). - app/Dockerfile: ffmpeg 추가 (썸네일 subprocess 용). API (의존 §1): - /api/audio/{id}/segments — AudioSegment ORDER BY start_s - /api/video/{id}/thumbnail — thumbnail_path FileResponse (쿼리 토큰) - /api/documents/{id}/file: media_types 에 audio/video mime 포함 (§2 커밋에 이미 포함). Starlette FileResponse 가 Range 자동. - upload_document: .mov/.mkv/.avi 웹 업로드 거부 (error_code unsupported_codec). NAS 드롭은 file_watcher 가 quarantine 수용. 프론트: - AudioPlayer.svelte: HTML5 audio + 전사 세그먼트 sticky 패널 + 줄 클릭 seek. activeIdx 하이라이트. - VideoPlayer.svelte: HTML5 video direct play + needs_conversion 안내 카드. poster 는 thumbnail endpoint. - /audio (목록 grid) + /audio/[id] (플레이어) - /video (썸네일 grid + 변환 필요 배지) + /video/[id] (플레이어) - Sidebar.svelte: Mic/Film 아이콘 + audio/video 네비 활성, count 배지 (§2 /stats/category-counts 재사용). 설정: - app/core/config.py: stt_endpoint + roon_library_path. DoD 배포 후 smoke: /ready cuda:true, 회의 mp3 transcribe, audio extract 없이 classify 진행(queue 회귀), /audio 재생, .mp4 재생, .mov 웹 400, .mov NAS quarantine, Sidebar 네비 + count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 06:47:36 +09:00
Hyungi Ahn	e861784c86	fix(ocr): align torch/transformers with native venv (0.17.1 호환 확인된 조합) 이전 base image (pytorch/pytorch:2.5.1-cuda12.4) 가 surya-ocr 0.17.1 설치 시 torch 2.11.0 (PyPI CPU wheel) 로 업그레이드되지만 torchvision 0.20.1+cu124 는 유지돼 ABI 불일치 (torchvision::nms does not exist) → OCR 전체 실패. native /opt/surya-ocr/venv 에서 검증된 조합으로 복제: - python:3.12-slim base - torch 2.11.0+cu126 / torchvision 0.26.0+cu126 (PyTorch cu126 index 고정) - transformers 4.57.6 (5.x 는 surya detection.processor import 에서 실패) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 13:59:15 +09:00
Hyungi Ahn	f8f72ceae2	fix(ocr): Surya 0.17 API + NFC/NFD path normalize - services/ocr/server.py: surya 0.17.x predictors 기반으로 재작성 (구 `from surya.ocr import run_ocr` 제거됨 → import error → 빈 텍스트 반환) - NFC(DB 경로) vs NFD(NFS 파일시스템) 한글 정규화 mismatch 보정 - surya-ocr 버전 0.17.1 고정 (0.6~1.0 범위는 breaking change 노출) - AIClient.ocr() NotImplementedError 제거 (호출처 0건, extract_worker 가 ocr-service HTTP 호출을 직접 사용) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 13:52:19 +09:00
Hyungi Ahn	7883ac67b3	feat(ocr): Surya OCR 마이크로서비스 추가 GPU 가속 OCR (Surya, Apache 2.0) 별도 컨테이너로 추가. 스캔 PDF/이미지 파일의 텍스트 추출 지원. - services/ocr: Dockerfile + server.py + requirements.txt - /health (liveness) + /ready (readiness, CUDA+모델 상태) - /ocr: 페이지 단위 스트리밍 처리 (메모리 피크 억제) - docker-compose: ocr-service + GPU reservation + ocr_models 볼륨 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:03:55 +09:00
Hyungi Ahn	2a240cb9e9	fix(kordoc): adaptive parse timeout + 동시 파싱 제한 kordoc의 30초 하드 타임아웃을 파일 크기 비례 adaptive(60~300초)로 변경. 대형 PDF/HWP가 파싱 타임아웃으로 영구 실패하던 문제 해결. - getParseTimeoutMs(): 10MB당 60초, 최소 60초, 최대 300초 - parseJobs Map 기반 동시 파싱 2건 제한 (유령 작업 누적 방지) - 상세 로그: START/DONE/ZOMBIE_DONE/REJECTED + ext/size/elapsed/active - clearTimeout으로 정상 완료 시 불필요한 타이머 콜백 정리 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:00:12 +09:00
Hyungi Ahn	2dfb05e653	fix: convert kordoc service to ESM (kordoc requires ESM import) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 14:38:34 +09:00
Hyungi Ahn	299fac3904	feat: implement Phase 1 data pipeline and migration - Implement kordoc /parse endpoint (HWP/HWPX/PDF via kordoc lib, text files direct read, images flagged for OCR) - Add queue consumer with APScheduler (1min interval, stage chaining extract→classify→embed, stale item recovery, retry logic) - Add extract worker (kordoc HTTP call + direct text read) - Add classify worker (Qwen3.5 AI classification with think-tag stripping and robust JSON extraction from AI responses) - Add embed worker (GPU server nomic-embed-text, graceful failure) - Add DEVONthink migration script with folder mapping for 16 DBs, dry-run mode, batch commits, and idempotent file_path UNIQUE - Enhance ai/client.py with strip_thinking() and parse_json_response() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 14:35:36 +09:00
Hyungi Ahn	131dbd7b7c	feat: scaffold v2 project structure with Docker, FastAPI, and config 동작하는 최소 코드 수준의 v2 스캐폴딩: - docker-compose.yml: postgres, fastapi, kordoc, frontend, caddy - app/: FastAPI 백엔드 (main, core, models, ai, prompts) - services/kordoc/: Node.js 문서 파싱 마이크로서비스 - gpu-server/: AI Gateway + GPU docker-compose - frontend/: SvelteKit 기본 구조 - migrations/: PostgreSQL 초기 스키마 (documents, tasks, processing_queue) - tests/: pytest conftest 기본 설정 - config.yaml, Caddyfile, credentials.env.example 갱신 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 10:20:15 +09:00

22 Commits