hyungi_document_server

Author	SHA1	Message	Date
hyungi	01db4816fd	feat(workers): drain 연속보류 내성 — 네트워크 플랩 흡수 (--defer-retries/--defer-wait) 실측 origin: Tailscale direct 경로 ~10분 플랩(13:25~13:34)으로 300건 run 이 32건에서 조기 종료. 보류 시멘틱 자체는 정상(무손상) — run 지속성만 보강. 연속 보류 5회까지 120s 간격 재시도, 한도 도달 = sleep 판정 종료. 성공 시 카운터 리셋. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 13:42:10 +09:00
hyungi	e7c7a2091f	fix(workers): 보류 분류에 라우터 502/504 추가 — upstream 절단이 라우터 경유에선 502 로 표면화 llm_router.py 실측: upstream 연결 실패/생성 중 절단 = HTTPException 502 (4곳). 맥북 sleep 절단의 실제 표면이라 503 단독 분류는 보류 누락 → 502/503/504 로 확장. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 13:00:55 +09:00
hyungi	88e5893041	feat(workers): 맥북 M5 Max 분담 배선 — deep 슬롯 + 보류 시멘틱 + queue_drain CLI plan ds-macbook-offload-1 P2 (Soft Lock 예외 박제 ds-macbook-offload-exec-20260611.md): - config ai.models.deep optional 슬롯 (라우터 :8890 경유 qwen-macbook, 부재 시 기존 경로) - AIClient.call_deep + is_deferrable_error + call_deep_or_defer (자동 cloud/맥미니 폴백 0) - deep_summary_worker: deep 슬롯 시 맥북 경유 (맥미니 mlx gate 미점유) + 실모델 기록 - StageDeferred 보류 시멘틱: 503/connect/read-timeout(sleep 절단) = attempts 미소모 + payload.deferred_until 30분 백오프, doc 쓰기는 완주+파싱 후 단일 커밋 (부분 쓰기 0) - queue_consumer: claim 에 deferred 필터 + StageDeferred 분기 - workers.queue_drain: 수동 burst-drain CLI (summarize/deep_summary, SKIP LOCKED 단건 claim, per-item 커밋, 보류 시 run 종료, deep 슬롯 필수 가드) - tests 20건 + 라우터 경유 Qwen 실응답 fixture 박제 (13.2s 라이브) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 12:55:16 +09:00
hyungi	cd06ef0403	feat(eid): 이드 채팅 표면 — /api/eid/chat SSE 스트리밍 + /chat 페이지 (P1) - compose: eid_chat surface 등록(persona+rules, 자유-prose) + rules_present() 라이브 판정(D-6 fail-closed) - EidAIClient.call_stream: 닫힌 mode 매핑(daily→mac-mini-default/deep→qwen-macbook), router 경유, MLX gate(FOREGROUND)+wall-clock 300s deadline, SSE 라인 relay(model→mode 치환·usage 제거), router 400 fail-loud, error_reason allowlist sanitize - POST /api/eid/chat: JWT, role=system 422 거부, 8000자/40턴/총량 32000 cap, 503 error_reason(ask 컨벤션), 본문 무로깅 - frontend /chat: 이드 표면 문법(일상/심층, 모델·머신명 비노출), SSE 파서(경계 buf·flush·[DONE]), error_reason UX, 8000자 선차단+422 오염 차단, localStorage 이력(logout 시 제거), nav 등록 - Caddyfile: encode 명시 match로 text/event-stream gzip 버퍼링 제외 - tests: 신규 32+ (fixture: router 경유 26B/27B SSE 박제), tests/eid 61 + ask 회귀 9 = 70 passed - 적대 리뷰 3렌즈 18 finding 반영 13/13. 배포는 D26 게이트(fix/hwp 머지+Soft Lock) 대기 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 11:16:44 +09:00
hyungi	321d997123	fix(news): 연결 재시도 2회로 보강 — 드랍이 연결 단위 랜덤(재시도 1회도 연속 피격 실측) + 빈 에러 로그 repr Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 07:54:13 +09:00
hyungi	b75307b89b	fix(news): 연결 계층(TCP/TLS) 오류 1회 재시도 — MOEL 보안장비 첫 핸드셰이크 간헐 드랍 (재실측 진단) GPU 회선에서 moel.go.kr 첫 TLS 연결이 간헐 드랍(curl rc=35, 직후 재시도 5/5 성공, 맥북 무발생·단일 A 레코드) → 사이클당 1회 fetch 인 피드가 ConnectError('') 누적, 입법행정예고 circuit open. ConnectError/ConnectTimeout 만 1.5s 후 1회 재시도, HTTP 상태 오류 비대상. 회귀 테스트 3건 (42 passed). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 07:43:05 +09:00
hyungi	f3530e382d	fix(services): playwright-fetcher CF JS 챌린지 통과 대기 — aiche.org 인터스티셜 스냅샷 함정 (검증 게이트 발견) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 07:23:58 +09:00
hyungi	8583465c58	feat(news): crawl-24x7 사이클 3 — B-4 시그널·C-4 공학 지속·CSB sitemap·CCPS Beacon (마이그 327) - B-4 fetch_method='signal-only': 페이지 fetch 0 + summarize 스킵(검색 색인만, 맥미니 부하 0) + 본문 무절단(_entry_body — arXiv 초록 1.6K 보존). 다이제스트는 ai_summary NULL 제외 규칙으로 자연 배제. 레지스트리 오설정(page) 방어 가드. - 시드 9 소스 (전 URL 2026-06-11 live 검증): Bloomberg Markets/Technology(skip-video, 비디오 혼재 실측)·Economist Latest·Nikkei Asia(RDF — feedparser 네이티브, 분기 불요 fixture 박제)·ASME JPVT(site_1000037 실측 매핑)·arXiv 2종·IEEE Spectrum 2종(feed-full, 피드 description 이 전문 7.9~14K자 실측). - csb_collector: sitemap lastmod diff (weekly 월 06:50) — 워터마크(selector_override) + cap 40/회 점진 백필 + diff sanity 300 + 보고서 PDF(/assets/, recommendation 제외) → extract 파이프라인. 초기 일괄 = CLI --bulk. - api_standards_collector: 공지 목록 링크 파싱(실측 — 페이지 diff 아님, 상세 URL 10건/페이지) → 신규 상세만 ingest (monthly 5일 07:05). 초기 백필 = CLI --bulk. - ccps_collector: aiche.org 평문 403(UA 무관 실측) → playwright-fetcher 익명 컨텍스트 + referer 쿠키 승계 /download(base64) 신설로 월간 Beacon PDF (monthly 5일 07:20). 헤드리스 차단 시 CrawlBlocked → health 가시화 (르몽드 PARK 선례). - B-5 잔여: rdf/feed-reader-UA = 코드 분기 불요 실측 박제 (Economist 는 Archiver UA 200). table-strip/gn-redirect 는 해당 소스 미진입 — 백로그 유지. - 테스트 24건 신규 (fixture 9건 live 박제, economist/ieee 는 item trim) — 39 passed. - 마이그 327 단일 statement (PKM 트랙과 번호 경합 주의 — 327 본 트랙 선점). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 07:13:17 +09:00
hyungi	f4e5db9723	fix(news): 304 를 redirect 로 오인하던 버그 — is_redirect → has_redirect_location httpx 의 Response.is_redirect 는 3xx 전체(304 Not Modified 포함)에 True 라, 조건부 GET 으로 304 를 받으면 location 없는 같은 URL 을 3회 재요청 후 'redirect 3회 초과'로 오류 처리 → ETag/Last-Modified 받는 안정 피드(SEP/HSE/OSHA /철학 RSS 등)가 2번째 사이클부터 전멸하던 systematic 버그. - 304 처리를 redirect 루프보다 앞으로 이동. - redirect 판별을 has_redirect_location(=location 헤더 있는 진짜 redirect)으로 교체. news_collector._fetch_rss + crawl_politeness.fetch_page 동일 함정 양쪽 수정. - 사이클 1 파일럿(경향)은 304 를 받은 적 없어 잠복했고, 안정 피드 첫 304 에서 발현. - 회귀 테스트 3건(304 비-redirect / 진짜 redirect / 코드 패턴 audit). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 06:32:15 +09:00
hyungi	69db9bcb94	fix(news): 안티봇 챌린지 페이지 식별 게이트 — DataDome corruption 차단 (B-3 실측) 르몽드 기사 = DataDome Client Challenge(316자)가 200자 본문 floor 통과 → 챌린지 HTML 이 기사 본문으로 승격되는 silent corruption 위험. fetch_page_via_browser 에 챌린지 마커 게이트 추가 → CrawlBlocked(degrade=RSS 요약 유지). 헤드리스 탐지라 재시도 무의미. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:04:11 +09:00
hyungi	61e5a416d0	fix(news): fetch_page content-type 허용 파라미터 — TWI sitemap(text/xml) 수집 (검증 게이트 발견) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:41:30 +09:00
hyungi	cdf4ee0ef6	fix(news): Guardian sectionName 'World news' 카테고리 매핑 (셀프 리뷰) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 16:37:22 +09:00
hyungi	1842f27d89	feat(news): crawl-24x7 사이클 2 — B-2/B-3/C-1/C-2/C-3/C-5 (마이그 324-326) - 채널 인지화: news_sources.source_channel(324, documents enum 재사용) → 문서 생성 정체성(_doc_identity)·embed/chunk 30일 게이트(crawl=전량 색인)· extract 후속 override(crawl→classify, preview 스킵) 분기. - B-2 Guardian Open Platform: API 디스패치(호스트 분기, 미지 호스트=명시 실패) + show-fields=bodyText 전문 어댑터. fixture live 박제 + call-shape 테스트. - B-3 구독지: playwright-fetcher 격리 컨테이너(동시 1·요청당 브라우저·storage_state ro mount) + politeness 사람속도(30-60s) 브라우저 경로 + fulltext 인증 라우팅 (내용 기반 probe 게이트·relogin_requested 소비=open-스킵보다 앞·본문 페이월 마커 게이트) + source_health probe 컬럼(325) + 세션 박제 스크립트(맥북용). - C-2 KOSHA: 3 API live 검증·fixture 박제(board/attach/guide) — 재해사례 daily diff +첨부 PDF/HWP→extract 파이프라인, GUIDE 일일 cap 점진 백필(silent cap 금지 로그). 키는 URL 직결합(재인코딩 함정 회피). daily 06:40 KST. - C-3 정적 코퍼스: National Board 86 + TWI job-knowledge 153 일괄 CLI(멱등·politeness ·crawl_raw 보존·fulltext_worker 승격 필드 규약 동일). - C-1/C-5 시드(326): 전 URL live 검증 — UK HSE(feed-full)/안전신문/고용노동부 3종 (rss/*.do)/OSHA/EU-OSHA(후보)/SEP/1000-Word(feed-full)/Doing Philosophy/Aeon/Psyche (skip-video quirk). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:08:18 +09:00
hyungi	53a30449e2	fix(news): crawl_politeness logger 를 setup_logger 로 정합화 — INFO 대기 로그 가시화 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:47:18 +09:00
hyungi	ab668d7990	fix(news): crawl_raw 파일명 CHAR(64) 패딩 strip + politeness 대기 로그 - documents.file_hash 실 컬럼이 character(64) — 32자 해시가 공백 패딩되어 gz 파일명에 공백 32개 포함 (실배포 1건 실측). _raw_html_path 에서 strip. - _respect_domain_rate silent sleep 에 대기 로그 1줄 (검증 게이트·운영 가시성). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:43:29 +09:00
hyungi	dcf99b377e	fix(news): 적대 리뷰 반영 — reconcile auto-correlation·워터마크 검증 후 영속·수집 락 - fulltext_worker.reconcile_unresolved: EXISTS 서브쿼리 aliased(ProcessingQueue) — auto-correlation 이 FROM 전부 제거해 매 실행 InvalidRequestError (안전망 dead code). SQLAlchemy 2.0.50 컴파일 재현·수정 확인. - news_collector._fetch_rss: ETag/Last-Modified/content-hash 영속을 bozo 파싱 검증 뒤로 이동 — 부패 응답 워터마크 저장 시 영구 304-skip 차단. - news_collector.run: 모듈 락으로 수동 collect vs 6h 스케줄 동시 실행 차단 — _get_or_create_health 동시 INSERT 의 uq_source_health_source_id 위반이 사이클 전체를 죽이는 경합 봉쇄. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:34:46 +09:00
hyungi	7cd8cfde0a	feat(news): crawl-24x7 A그룹 — 레지스트리 증축·조건부 GET·fulltext 승격·politeness·source_health A-3 migrations 319-323 (news_sources 9컬럼 + source_channel 'crawl' + process_stage 'fulltext' + source_health) A-1 조건부 GET(ETag/Last-Modified 그대로 재전송)+콘텐츠 해시 변경감지, A-4 politeness 코어(per-domain 직렬+robots+정직UA), A-2+A-7 fulltext_worker(4-tier 재사용·NAS crawl_raw gzip 보존·격하 경로·03:40 reconcile 안전망), A-5 circuit breaker(3/10 임계, enabled 미터치), A-6 포털 전재 2차 dedup(제목+3일, 12자 게이트). 기존 소스 fulltext_policy='none' 기본 = 무회귀. plan crawl-24x7-1, 예외 박제 crawl-24x7-exec1-20260610.md	2026-06-10 13:03:31 +09:00
hyungi	acd595244a	fix(news): URL dedup 정규화 저장·조회 통일 + 다중매칭 내성 BBC Technology 매 사이클 MultipleResultsFound (06-04~) 해소. - 저장 edit_url=raw vs 조회 normalized 비대칭으로 URL dedup 무력화돼 교차게시(HN x BBC) 시 2행 동시매칭 -> scalar_one_or_none raise. - _normalize_url: query 전체 제거 -> tracking 파라미터만 제거로 교정 (hada.io/topic?id= 등 query-식별 사이트 870건 붕괴 방지, 리뷰 게이트). - 조회 .first() + edit_url IN (normalized, raw) 레거시 행 내성. RSS/NYT 양쪽. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-09 22:26:22 +00:00
hyungi	34eb5c9411	refactor(workers)!: SMTP 메일 발송 기능 전면 제거 다이제스트/이메일수집알림/법령알림 메일 발송 폐기 (사용자 결정 2026-06-10). 근거: 게이트(if smtp_host and smtp_user)가 06-07 전엔 항상 false(silent skip), 자격증명 활성 후엔 100% 553 Sender rejected — 한 통도 전달 성공 이력 없음. law_monitor 는 CalDAV VTODO 가 단일 알림 채널로 유지. 다이제스트 .md 생성/ 90일 아카이브, 이메일 IMAP 수집은 무변경. eid dispatch 의 send_smtp_email 문자열 블랙리스트는 의도적 잔존(코드층 박탈 강화와 정합). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-09 22:26:22 +00:00
hyungi	8e1645dfc9	fix(markdown): news article md_status pending→skipped 정합화 news article 은 텍스트 네이티브(본문=extracted_text)라 markdown 단계를 미enqueue 하는데(summarize/embed/chunk 만), md_status 기본값 pending 이 영구 고착돼 30,903 건이 비수렴 → (1) backlog 지표 오염(실 미변환≈0인데 pending 30,930) (2) md_status_pending partial 인덱스 비대. terminal skipped(변환 비대상)로 정합화. - news_collector.py: RSS/API 양쪽 Document 생성 시 md_status=skipped + md_extraction_error 사유 명시(생성 시점부터 정합). - documents/[id]/+page.svelte: article 뷰의 MarkdownDoc 에 mdStatus 미전달(null). badge 는 mdStatus 로만 구동 → skipped 라도 "Markdown 제외" 칩이 3만 기사에 뜨지 않게(article 은 markdown 변환 비대상이라 badge 자체가 무의미). - 기존 30,903 건 backfill UPDATE(별도 실행): pending 30,930→27, partial 인덱스 동일 축소. 검증: pending 잔여 27(eml/doc/xls/이미지/미디어 long-tail) / 검색 무영향(article extracted_text·chunks 그대로) / md_status 만 변경. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 06:22:04 +00:00
hyungi	55216271a6	feat(markdown): hwp raster 이미지 NAS 영속 + library backfill 스크립트 pyhwp(hwp5html) 가 bindata/ 로 추출하는 raster 이미지를 NAS 에 영속한다. 기존엔 변환 tempdir 와 함께 폐기돼 경고 없이 silent 유실(도식·수식)이었다(적대 리뷰 MEDIUM). - office_md.py: _run_hwp5html 으로 hwp5html 1회 실행 → (markdown, raster_images). convert_hwp_to_md_and_images() 신규 = marker_worker 이미지 경로용. hwp5html 은 이미지를 본문 xhtml 에 <img> 앵커하지 않아(--css/--html 동일) 인라인 위치 복원 불가 → 호출부가 말미 갤러리로 부착. OLE 수식/도형은 앵커도 raster 도 아니라 영속 제외. - marker_worker._process_office: .hwp raster 를 marker(PDF)의 _persist_images_to_nas 로 NAS 영속 + document_images UPSERT(_sync_document_images, 재변환 orphan 정리) + md 말미 ## 첨부 이미지 docimg: 갤러리 + quality.warnings hwp_images_appended. docx/xlsx/pptx/ hwpx 는 이미지 미처리(기존 동작 유지). - scripts/backfill_hwp_library.py: 지정 PKM 폴더 .hwp 를 content-hash dedup(Inbox 중복 + _1/카피본 사본 흡수) 후 category=library 일회성 ingest. 검증(E2E): Knowledge/Engineering 18개 → dedup 후 신규 5개(산업안전기사 3~7과목) ingest, 5/5 success. 제4과목 raster 3장 → NAS extracted_images/35778/img_001~003.jpeg 실재 + document_images 3 row(engine=pyhwp) + md 갤러리 docimg ref. 이미지 없는 문서는 갤러리 미생성. 텍스트/표 경로 회귀 0(기존 4건 재변환 success). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 05:10:45 +00:00
hyungi	d0994a1bce	fix(markdown): hwp 변환 libhwplo→pyhwp 교체 + xml 프롤로그 strip LibreOffice 번들 libhwplo 필터가 실제 한컴 HWP5 binary 를 못 읽어(rc=0 + "source file could not be loaded") HWP 전건 실패(0/4). 순수 Python HWP5 전용 변환기 pyhwp(CLI hwp5html)로 교체. - office_md.py: .hwp → _via_pyhwp_html(hwp5html→index.xhtml→markdownify). hwp5html xhtml 의 <?xml?> 선언이 markdownify PI 파싱으로 md 본문에 새고, ~34자가 _MIN_BODY_CHARS(16) 빈출력 게이트를 무력화(빈 변환 false-success, 모듈 불변식 위반) → markdownify 전 프롤로그 re.sub strip. - .hwpx 는 pyhwp 미지원 → LibreOffice 폴백 유지. - marker_worker.py: 엔진 라벨 .hwp→pyhwp / .hwpx→libreoffice_hwp / else→markitdown. - requirements.txt: pyhwp + six(pyhwp 미선언 런타임 의존성). 검증: HWP5 4건(용접 WPS/PQR·산업안전기사 1·2과목·원칙요약) 4/4 success, 한글 무결·표 GFM 보존·xml 아티팩트 0. 기존 포맷 경로(docx/xlsx/pptx·pdf· passthrough·hwpx) 회귀 없음(적대 리뷰 2렌즈 확인). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 04:19:37 +00:00
hyungi	aeb9290cbd	feat(documents): hier 절 char_start offset (Path B) — md_content 점프 builder offset 플랜 ds-outline-anchor-b5 (g1~g6 코드). 핵심 ASME/법령 windowed 절의 0% 점프를 서버계산 char_start(builder offset)로 100% deterministic 점프로 전환. - g1 migration 318: document_chunks.char_start INTEGER NULL (단일 statement, 멱등) - g2 builder: char_start emit = FE 라인/offset 모델 미러(split('\n')+UTF-16 code unit+코드펜스 skip). window-child=NULL, split-parent=heading offset, preamble=NULL, CR 미strip, NFC=telemetry. node.text 보존(라인모델 hash-neutral) → hash_stable doc 보존. 단위테스트 7건. - g3 persist+backfill 하이브리드: * persist INSERT char_start * update-char-start (g3-tU): hash_stable doc 비파괴 — 100% jump-target VERIFY(NEW-1) + position-aligned PK UPDATE(NEW-2), 미달 doc DEMOTE → re-decompose 합류(NEW-4) * --reprocess (g3-t2): md_content 출처(g0-t1) + jump-target-set 완료마커(B1) + B_jumptarget>=1(B3), --doc 필수 else REFUSE. self-heal sweep(g3-t3). - g4 /sections: char_start inner+outer SELECT + split-parent 노출(is_leaf OR %_split) - g5 FE: resolveAnchorMap(BE-first, NEW-5 jump-target-candidate-scoped 폴백, C1 OR-exclude), per-render-site basis guard(C3), endsWith('_split') 정정 + collapseWindows split-parent 흡수(C2). 단위테스트 25건(NEW-5/B4/C1/C2 포함). - g6 hier_outline_quality_gate.py: read-only g-measure(verdict/B_jumptarget/hash_stable/dup/fence) 배포(g7: --no-deps, 스냅샷, UPDATE-only 32 + re-decompose 230∪demote, 정확도 게이트)는 별 ops 단계. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 10:12:26 +09:00
hyungi	daf6a0ade9	feat(documents): S1 dedup·office-md·storage scaffold (B/C/D/E) plan ds-s1-backend-1 잔여 구현 (A·C-1 은 `16b0fe1`): - B 중복검사: services/dedup.py (OFF-list law_monitor 공용) + 업로드 채움(B-1) + GET /documents/duplicates(B-2) + post-upload near-dup 비동기(B-3) + backfill_dedup.py(B-4) + 야간 dedup_reconcile 잡(03:30 KST 멱등 재계산) - C MD-first: marker_worker office/hwp 분기 _process_office(C-2) + md_status 상태머신 postcondition success\|failed(C-5) + backfill_nonpdf_markdown.py(C-4) + requirements markitdown - D 스토리지: services/storage ABC+Range 계약 / LocalBackend / NasApiBackend 503 (D-1) + /file resolver 경유, 로컬 동작 불변(D-2) - E 운영: pre-change pg_dump + rollback_287.sql + apply runbook(E-3) + 테스트(E-1) 비파괴 불변식 유지(기존 응답 shape 무변경, md_status success→completed read-time 매핑). 어드버서리얼 리뷰 확정 1건(soft-delete canonical 승격 시 stale duplicate_of) → B-1 승격 정규화 + 야간 재계산으로 정합. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 03:05:30 +00:00
hyungi	68e2d7ea04	feat(documents): S1-ADD dedup·원본명 3컬럼 + md_status success→completed 매핑 (A) + office→md PoC (C-1) plan ds-s1-backend-1 (r5 수렴). 코드만 스테이징 — migration 미적용(restart 보류, E-2 Soft Lock 예외창). A (앱 v1 디코딩 비파괴 최소선): - A-1 migrations/287_documents_dedup_fields.sql: original_filename TEXT / duplicate_of BIGINT FK ON DELETE SET NULL / duplicate_count INTEGER NOT NULL DEFAULT 0. 단일 statement·PG16 fast-path·BEGIN/COMMIT 금지. backfill 미포함(B-4). - A-2 app/models/document.py: 1계층 블록에 3 mapped_column (+ ForeignKey import). md_* 는 기존. - A-3 app/api/documents.py: DocumentResponse 3필드(duplicate_count=0 non-opt) + DocumentDetailResponse field_validator(success→completed, mode=before) — read-time DB→API 단방향, write(ORM) 미적용. - A-4 tests/test_s1_dedup_shape.py: success→completed 동작 + 비-success 통과 + 3필드 디폴트/roundtrip + ds-app contract fixture 디코드(skip-if-absent). py_compile OK. ★ backend 절반 — 전체 비파괴는 S3 render 테스트와 AND. C-1 PoC (워커 미연결 — C-2 에서 marker_worker 분기 연결): - app/workers/office_md.py: OOXML=markitdown(신규 dep, lazy) / hwp·hwpx=LibreOffice headless→HTML→markdownify(기존 dep). 실패·빈출력·타임아웃·dep부재 → OfficeMdError raise (success+빈md 금지 = C-5 postcondition 의 변환기 계약). - scripts/poc_office_md.py: 표 fidelity 측정 하니스. E-1 = prod LibreOffice 버전핀 안전컨텍스트 실행(hwpx 필터 버전 의존). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 03:05:30 +00:00
hyungi	5a19cde38c	fix(documents): 도메인 트리 카운트를 문서함 list 제외와 일치 트리(/documents/tree)는 deleted 만 제외하고 뉴스/법령/메모를 다 세는데, 문서함 list 는 source_channel news/law_monitor + file_type note 를 기본 제외 → '트리는 N건인데 클릭하면 0건' 불일치(예: Philosophy/Aesthetics 5건 전부 news+note 라 클릭 시 0). 트리 쿼리에 동일 제외 적용해 카운트=실제 표시 일치. 영향: Philosophy 12→2, General 189→84 등 정상화. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 09:57:47 +09:00
hyungi	6a85087b83	feat(eid): 이드 persona substrate W2~W4 — DS compose·약점진단·egress 코드층 박탈 전 로컬 LLM 관통 '이드' persona substrate 의 Document Server 측 빌드(W2~W4). 설계 = PKM eid-persona-substrate(r1~r3 수렴) / impl = eid-persona-impl. W2 — compose + 표면 배선: - app/eid/compose.py: persona→rules→overlay→task 단일 system 문자열 + 정적 ROUTE_MAP (런타임 sniffing 아님) + rules 부재 fail-loud · persona 부재 quiet · overflow fail-loud. - 자유-prose 3 표면(react_ask·study_subject_note·study_question_explanation) 중복 정체성· generic 정책 trim + compose 배선(AIClient 에 additive system 파라미터). 도메인 calibration 보존. - STRICT JSON 기계류(briefing_comparative·digest_topic)는 persona-ZERO 동결(불변식 #3). - app/prompts/substrate/: persona(외부 컴파일 산출물 vendor) + rules(생성 가드 서브셋) + overlay 5. W3 — migration + 워커 + study_diagnosis: - migration 301~305: eid_* append-only 원장(약점/복습초안/회고) + approval_requests(가변 큐) + 일정 파생뷰 2. - app/workers/study_weakness.py: study_question_progress.pattern_state 집계로 약점 derived 산출 (LLM 0) + bounded tier(watch/review/focus). nightly cron. - study_diagnosis 표면: 최신 스냅샷을 코치 언어로 번역(약점 판정은 코드, LLM 은 블록 값만 인용). W4-1 — egress 코드층 박탈: - app/eid/ai.py EidAIClient: 이드 표면 = call_primary(내부 MLX) only. 외부 LLM fallback 경로 구조적 봉쇄(call_fallback raise · 자동 fallback 제거 · 외부 endpoint 차단). egress 워커는 분리 유지. load-bearing 정정 3(환경 grounding 강제, 설계 회귀 아님): - rules = 운영 ruleset 전체 → 생성 가드 서브셋(HTML 산출물 룰이 study task 와 충돌). - append-only = REVOKE → CREATE RULE DO INSTEAD NOTHING(단일 owner role 은 REVOKE 무효 + migration 검증기가 plpgsql BEGIN 거부) + actor/source_* NOT NULL 스탬프. - 이드 LLM 봉쇄 = path discipline → EidAIClient 구조화. 검증: eid 순수 단위테스트 30 통과 + py_compile + migration 검증기 모사 + egress 적대감사 COMPLETE. DB/LLM/httpx 의존 테스트(append-only RULE·EidAIClient·E2E)는 staging(Docker) 가동. W4-2 네트워크 belt 은 조건부 보류(코드층 1차 충분, P0-3② 원격 실측 후 hard-gate 시 승격). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 15:13:20 +09:00
hyungi	c12c04a9b1	fix(study): 복습 큐 cold-start — /due 에 신규 승인 카드 포함(첫 회상) B2 /due 가 due_at<=now(progress 보유) 카드만 반환 → progress 는 rate_card(=/rate)로만 생기고 /rate 는 /due 카드만 평가 → 신규 승인 카드가 SR 큐에 영영 못 들어가는 순환 갭. 복습 트랙이 절대 안 채워짐. - /due 를 outerjoin 으로 재작성: 신규(progress 없음=첫 회상 전) OR 예정 due(due_at<=now, stage<4). 예정 due 먼저, 신규(due NULL) 뒤로. '첫 회상 후 due' 규칙·시안('오늘 복습'에 stage0 신규 포함)과 일치. - 신규 카드 '암'은 백엔드가 due 안 박음(외움→큐 제외, 큐 폭발 방지)이라 correctLabel(null)='안 나옴'으로 정합(기존 '+3일'은 거짓 라벨). 큐 stage0 '암'은 그대로 '+3일'. 검증: py_compile OK. 신규 암→progress(due null, 재출제 X) / 애매·모름→due 내일 입고 / 큐 stage 전진 불변. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 11:45:07 +09:00
hyungi	861db96305	feat(study): 카드 SR 모바일 학습 UI — 복습/그냥공부 2트랙 (B3) 검수 완료 카드를 모바일에서 학습하는 UI. 복습(SR)=앞면 회상→reveal→3단 자기평가(모름/애매/암) / 그냥공부(cram)=덜 본 순 휙휙+봤다(SR 무관). - 새 페이지 /study/cards-study(+page.svelte): landing 트랙선택·진행바·결과(세션 tally)·빈/로딩 상태·cram format 필터·키보드(Space reveal·복습 J/K/L·cram Enter). 아이폰15PM 우선, 세이지 토큰. - '암'(correct) 버튼 stage별 동적 라벨(+3/7/14일·졸업), 모름/애매=내일. correctLabel은 sr_schedule REVIEW_INTERVAL_DAYS 미러(라벨 전용, 산술 정본은 백엔드). - API: /study-cards/due CardItem에 review_stage 추가(복습 큐에서만 채움, 동적 라벨용). _build_card_items(session,cards,stages) 확장, /due는 select(card, progress.review_stage)로 변경. - 진입: 허브 '암기카드 학습' 카드+예정목록 갱신 / 검수 UI 헤더 '학습' 버튼. 검증: py_compile OK · 4차원 적대검토(runes·API계약·SR규칙·UX) 통과(확정 조치 0, 지적 2건 거짓양성). 로컬 vite 빌드 불가(node_modules 부재)→배포가 컴파일 게이트. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 11:37:19 +09:00
hyungi	0d274cc5fe	feat(study): 카드 SR writer + 두 트랙 API (B2 — 복습/그냥공부) 검토 완료 카드를 학습하는 백엔드. 복습(SR)=즉시 자동 입고 / 그냥공부(cram)=봤다 기록, SR 무관. - migrations 299(idx_card_progress_due partial) + 300(study_memo_cards view_count/last_viewed_at). - StudyMemoCardProgress 모델(294 미러, UNIQUE user+card) + rate_card(get-or-create → sr_schedule.advance/first_due, 즉시 자동 입고: 애매/모름 평가 즉시 due, 암은 due 안 박음). - StudyMemoCard view_count/last_viewed_at + record_card_view 헬퍼(cram, SR 무관). - API: GET /study-cards/due(복습 큐, 검수통과만) · POST /{id}/rate(자기평가 read-time 매핑) · GET /deck(cram, 덜 본 순) · POST /{id}/view(봤다 기록). 검증: 부팅+8라우트 등록 · 287~300 ephemeral 적용(인덱스·컬럼 확인) · sr_schedule 회귀 7/7(B1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 10:18:17 +09:00
hyungi	e1da984e08	refactor(study): SR 산술 sr_schedule.py 공용추출 (B1 — 카드 SR 토대) 문제 SR과 카드 SR이 같은 간격 상수·산술을 참조하도록 순수함수 추출. 운영 동작 무변경. - app/services/study/sr_schedule.py: REVIEW_INTERVAL_DAYS{1:3,2:7,3:14}/MASTERED=4/FIRST_DUE=1 + advance(stage,outcome,now)→(new_stage,new_due) \| None(skipped) + first_due(now). 진입 게이트(due_at IS NOT NULL/최초 due/skipped 불변)는 호출부 잔류(finalize vs review-complete 정책 차이). - session_finalize.py: 상수·advance 분기 → sr_schedule import + sr_advance() (re-export 유지). - study_question_progress.py: DEFAULT_FIRST_DUE_DAYS → sr_schedule import. - 회귀 테스트 7/7: 전진 1·3·7·14·졸업·리셋·skipped불변·상수 + 전 stage×outcome 구 로직 바이트 동등. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 10:11:38 +09:00
hyungi	e9a95934ef	feat(study): 카드 검수 그룹핑 — manual(직접 추가) 카드를 자료(material)별 묶음 + source_kind 노출 직접 추가 자료 카드(source_kind='manual', 출처 문제 없음)가 검수 UI에서 null 한 덩어리로 뭉치지 않도록 extra.material 별 그룹("[자료] ...") + CardItem.source_kind 노출(프론트 '직접 추가 자료' 라벨). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 09:41:13 +09:00
hyungi	b9f2ade55e	feat(study): 암기카드 검수 UI — 백엔드 카드 review API + SvelteKit /study/cards-review 577 카드(needs_review=true)를 보고 채택/수정/폐기하는 첫 검수 화면(학습 흐름 '마지막 한 칸' 1번). - 백엔드 app/api/study_cards.py(prefix /api/study-cards): GET(출처 문제별 그룹, evidence 동반)·needs-review/count·PATCH(승인 needs_review=false / 수정 시 dedup_hash 재계산+검수완료)·DELETE(soft)·approve-batch(문제 단위, 전체 일괄승인 없음). - 프론트 /study/cards-review: 반응형 그룹 목록(문제+카드) · 카드별 승인/수정(인라인)/삭제 · 문제 단위 일괄승인 · format 필터 · 세이지 토큰. study 허브에 진입 링크+대기 카운트 배지. - 카피 drift 정정: 허브 '예정(Phase 2~)'이 가동 중인 퀴즈/SRS/통계를 잘못 표기 → 예정은 카드 SRS·모바일·알람으로 수정. 검증: 백엔드 부팅+라우트 등록 OK(4 route). 프론트 빌드는 배포 시 vite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 08:49:11 +09:00
hyungi	19f544fb5e	feat(study): 공부 암기노트 Phase 1 — 정정/삭제 훅 + needs_review 큐 + 알람 재료 (HR/A) 추출 파이프라인(287~298, 별 커밋) 위 HR/A. 신규 마이그레이션 0 (DDL은 295~298 재사용). - HR 정정/삭제 훅: PATCH 본문 수정 → 파생 study_memo_cards needs_review=auto(source_changed), soft-DELETE → source_deleted. flag_cards_for_source 헬퍼(임시 플래그, 최종정리는 워커 supersede). - HR needs_review: PATCH set/clear(flagged_by='user' 서버강제) + GET /study-questions/needs-review 목록·count(부분인덱스 술어 일치, 동적 {id} 라우트보다 먼저 등록해 int 파싱 충돌 회피). - A 알람 재료: study_topics.focused_at 공부중 토글 + study_reminder cron(09/13/19 KST, due 술어 quiz_selection SQL 재현·시간슬롯 truncate 멱등·LLM 0) + GET /api/study-reminders/latest(없으면 204). - 테스트: 가드/정규화 18/18 (정량=evidence 원문·cue/cloze 누출·dedup·배치). 검증: 앱 부팅 import+mapper OK · 가드 18/18 PASS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 08:08:55 +09:00
hyungi	0a7402b327	feat(study): 공부 암기노트 Phase 1 — card_extract 추출 파이프라인 (순수 additive) study_memo_cards 추출 파이프라인 + 버전키 폴러 + needs_review 컬럼. 운영 SR 코드(session_finalize/quiz_selection) 무수정. - migrations 287~298: study_memo_cards/_evidence/_jobs/_progress(P1 휴면)·study_reminders·study_topics.focused_at·study_questions needs_review 3컬럼. dedup PARTIAL UNIQUE(deleted_at IS NULL). - 워커: in-process RAG gather → MLX {cards} → 카드 가드(정량=evidence 원문 등장·cue/cloze 누출·dedup) → supersede 구버전 retire → append. 별 consumer 로 기존 study_queue 격리. - 폴러 study_card_enqueue: 버전키 NOT EXISTS(source_version) 멱등 + ai_explanation_generated_at NOT NULL 가드 + per-poll LIMIT(thundering-herd). - 검증: 실 prod 스키마 덤프 위 12 마이그 적용 OK + dedup/supersede/active-unique 기능 7/7 PASS + 정규화 util 15/15. plan: PKM plans/2026-06-05-study-memo-card-p1-plan.html Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-06 21:33:12 +09:00
hyungi	f269e0df27	ops(news): chunk_worker news_source 매핑 실패 가시성 가드 _lookup_news_source prefix 미일치 시 silent (None) 반환 → warn 로그 추가. loader 의 drop 로그와 대칭, 신규 source / RSS category 오염 재발 즉시 가시. 동작 변경 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 23:39:14 +00:00
hyungi	aa2d7814e3	feat(digest): date picker URL sync + article→문서 라우팅 + country 국기·한국어 - GET /api/digest/dates 신설 (브리핑 /briefing/dates 패턴 미러, read-only) - topic article 제목 enrich (documents 배치 1쿼리 + dedupe(set) + map-miss=null → 프론트 '(제목 없음)') - /digest 재작성: ?date=&country= URL sync(공유·뒤로가기), 국가 탭=인라인 SVG 국기+한국어, 기사=/documents/{id} 링크(상위5+펼치기) - Phase 4.5(PR #22) 후속. 검증: py_compile·dates/enrich 쿼리(275 resolve·miss 0)·frontend docker build PASS. 시각 렌더 검증=preview 게이트 대기 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 23:39:07 +00:00
hyungi	2f152911f7	feat(search): /ask corpus_variant + exact_knn (EVAL-ONLY) for passage-RAG diagnose PR-DocSrv-Hier-PassageRAG-Diagnose-1 c1. /ask evidence retrieval 의 chunk leg 를 측정 뷰(prehier/hier_sim_*)로 교체 + exact_knn — passage evidence 단위(hier 절 vs legacy 윈도우) 비교용. /search 와 동일 패턴, run_search 전달. EVAL-ONLY 박제, default(미지정) 시 기존 /ask byte/behavior 동일(회귀 0). pattern 검증 → 잘못된 값 422. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 06:14:59 +00:00
hyungi	100aaa3b0c	feat(search): corpus_variant + exact_knn measurement dispatch (replace-diagnose c4+c5) PR-DocSrv-Hier-Replace-Diagnose-1 c4+c5. hier vs prehier(legacy) go/no-go 비파괴 측정 hook. - 측정 뷰 3종 (hier_measure_views.sql, additive/droppable): corpus_chunks_prehier (legacy+null-source 375 포함) / hier_sim_raw / hier_sim_clean (childless-tiny<30 제외, all-tiny doc 은 legacy fallback 정합). - retrieval_service: _resolve_corpus_variant + CORPUS_VARIANT_MAP + _VALID_CHUNKS_TABLE 3 뷰 추가 + exact_knn(SET LOCAL enable_indexscan/bitmapscan=off, eval 전용). chunk leg 만 영향 (doc-level + fts/trgm = documents 무관). baseline/None path 회귀 0. - search_pipeline.run_search + search.py: corpus_variant/exact_knn 전달, unknown→400, embedding_backend cand 와 동시 사용 금지(400). - run_eval: --corpus-variant + --exact-knn flag. - tests/test_corpus_variant.py 22 PASS (resolver/map/allowlist + SQL injection 거부). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 05:37:15 +00:00
hyungi	f7198d9d68	feat(search): expose hier section outline & summaries in document detail PR-DocSrv-Hier-Section-UI-1 Phase 1 (코드+커밋만, 배포는 Phase 2 backfill 완주 후). - backend: GET /documents/{id}/sections — hier leaf 목차 + chunk_section_analysis 요약. document_chunks 직접 조회(retrieval 아닌 목차 표시라 corpus_chunks 뷰 의도적 우회 — docstring 명시). DISTINCT ON 으로 최신 분석 1행. - frontend: SectionOutline.svelte(좌측 목차, per-doc 동적 그룹/flat, window dedupe, 클릭 시 요약/breadcrumb 인라인), headingPath.ts 순수 유틸(+node:test 단위테스트 8케이스). [id]/+page.svelte 3-zone 레이아웃 + 우측 메타 Tabs [정보\|AI\|관리] 로 카드 스프롤 해소. - 절 없는 문서/404 는 목차 숨김(graceful). 본문 점프는 follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 00:22:34 +00:00
hyungi	cfadaaffd9	feat(search): hier section per-leaf analysis scaffold (Section-Summary-1 c1) chunk_section_analysis 테이블(migration 286) + ORM model + pilot script. document_chunks(retrieval-hot)와 분리된 절-레벨 분석 축. domain 상속, section_type 절-전용 역할 enum, status로 skip 박제, source_content_hash로 stale 탐지. script-only(scripts mount, rebuild 불필요). LLM 0 dry-run 검증 = 5225 147 analyze + 17 skip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 13:45:30 +00:00
hyungi	a7b16b63db	feat(search): doc-level atomic corpus replace + isolation test (Hier-Decomp-1 c5) replace_doc_corpus(dry_run): G5 precond(doc-local embed 100% + parent 무결성 + leaf>0) 검증 후 단일 트랜잭션 atomic 교체(legacy in_corpus=false / hier leaf in_corpus=true, predicate=is_leaf AND embedding NOT NULL, node_type 미사용). 물리삭제 없음. rollback_doc_corpus 역토글. precond 미충족 시 변경 0(legacy 유지). tests/hier_decomp/test_corpus_isolation.py: in_corpus=false leaf 가 corpus_chunks 누출 0 단언 (부분 ivfflat + 뷰 이중 choke point 회귀 가드). c5: dry-run 3 pilot precond_ok(5140 158L→271leaf / 5186 381→199 / 5225 18→164), 격리 테스트 PASS. 실제 replace 는 c6(1-doc-first). plan: hierarchical-decomposition-tiered-nesting-marmot.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 13:14:36 +00:00
hyungi	fa82bd495b	feat(search): hier persist + partial ivfflat index on in_corpus (Hier-Decomp-1 c4) persist_hier_tree(): build_hier_tree → document_chunks insert. source_type=hier_section, in_corpus=false, is_leaf 노드만 bge-m3 embedding. idempotent(기존 hier 행 삭제 후 재삽입). chunk_index = doc 별 (max+1) offset → 기존 (doc_id,chunk_index) unique 충돌 회피. embedding NULL 파라미터 asyncpg 타입추론 → cast(cast(:emb AS text) AS vector) 이중캐스트. migration 284/285: ivfflat 오염 fix. full 인덱스는 in_corpus=false hier 벡터까지 색인 → 근사 검색이 비활성 벡터에 오염(corpus_chunks 필터해도 근사 이웃 셋 흔들림). partial index (WHERE in_corpus=true)로 교체 → in_corpus=false 는 검색 인덱스에 부재 = 무영향 인덱스 레벨 보장. c4 pilot(5140/5186/5225) G3: 트리 insert, embed_coverage 1.0(doc-local 100%), in_corpus_true=0, dangling_parent=0, dup 0. 부분인덱스 후 검색 baseline IDENTICAL to 원래(pre-hier) = 691 hier 행 영향 0 검증(오염 fix 효과). replace 는 c5/c6. plan: hierarchical-decomposition-tiered-nesting-marmot.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 13:12:42 +00:00
hyungi	d982dce7d1	feat(search): rule hierarchy builder (Hier-Decomp-1 c3) 순수 함수 build_hier_tree(text) → heading 경계 segment 트리 (DB 미접근, c4 에서 insert). - 경계 규칙: ATX 마크다운(#{1,6}) > 한국 제N장/절/조 > 영문 Chapter/Section/Article. - segment = heading + 다음 heading 전까지 본문 (disjoint, 100% 커버). parent/level = heading 깊이 정규화 트리. - 과대 own-text(>HARD_MAX 5000) = 무overlap window 분해(자식 유무 무관), 부모 is_leaf=false(heading 마커, 코퍼스 제외). - 구조 전용 heading(자식 보유 + own body<30자) = is_leaf=false. is_leaf = replace 코퍼스 편입 대상. dry-run G2 (insert 없음, 5 pilot + headingless): - 5140/5186/5225/5151/5124 md_content: coverage 0.9993~1.0, dup_hash 0, empty 0, dangling 0, bad_level 0, leaf_max<=4973(<5000). - 5152 headingless extracted_text(238k): window 89 leaf, coverage 1.0, dup 0, leaf_max 3000. 관찰: tiny heading-only leaf(7~19자) 잔존(무해, tuning 후보). plan: hierarchical-decomposition-tiered-nesting-marmot.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 13:05:06 +00:00
hyungi	f940f50c60	feat(search): route retrieval through corpus_chunks view (Hier-Decomp-1 c2) baseline chunk 벡터검색을 document_chunks → corpus_chunks 뷰(in_corpus=true)로 rewire. in_corpus=false(비활성 hier leaf 등) 자동 제외 = 검색 오염 구조적 차단(B choke point). - retrieval_service: baseline chunks_table=corpus_chunks, _VALID_CHUNKS_TABLE 에 corpus_chunks 허용, snapshot_clause 조건 corpus_chunks 포함(eval snapshot 보존). candidate(cand_*) 경로 불변. documents 측(FTS+doc embedding) 무변경 — doc row 는 교체 무관. - models/chunk: 5 신규 컬럼 매핑(parent_id/level/node_type/is_leaf/in_corpus). server_default 로 기존 chunk_worker INSERT 무영향(legacy=in_corpus true/is_leaf false). - subject_note_rag/explanation_rag: RAG chunk 로드에 in_corpus=true 필터(교체 doc legacy 중복 방지). 게이트: G4b(rewire 불변) before/after IDENTICAL(현재 view==table no-op) / G4a(누출) synthetic in_corpus=false leaf 가 corpus_chunks 0건·document_chunks raw top(dist 0.0) 양방향 증명. /health 200. plan: hierarchical-decomposition-tiered-nesting-marmot.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:58:28 +00:00
hyungi	0854c72c70	fix(search): sync doc md_status to failed on permanent markdown queue failure marker_worker 는 변환 시작 시 doc.md_status=processing 으로 표시하는데, 변환이 _fail()/_set_skipped() 를 거치지 않고 예외(예: 대형 batch ReadTimeout)로 죽으면 queue_consumer 가 큐 행만 failed 처리하고 doc.md_status 는 processing 에 영구 고착 = orphan (큐 failed, 문서 processing). markdown consumer 분리 후 이 orphan 이 tail 재처리에서 재발(5149/5201)하여 근본 원인 차단. _process_stage except 블록에서 큐 항목이 영구 실패(attempts>=max)할 때 stage가 markdown 이고 doc.md_status=processing 이면 failed 로 동기화. 재시도 중 (attempts<max)엔 pending 큐 행이 남아 orphan 아니므로 미터치. 검증: synthetic 영구 실패 경로 → md_status processing→failed 동기화 PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:06:32 +00:00
hyungi	2edc80d4bb	fix(search): split markdown into dedicated queue consumer to prevent pipeline stall 대형 PDF split 변환(5210 ≈ 40분 실측)이 단일 consume_queue 코루틴을 점유해 extract/classify/embed/chunk 등 전 파이프라인을 stall 시키던 문제 제거. - consume_markdown_queue 신규 — markdown 전용 scheduler job (id=markdown_consumer) - consume_queue 는 MAIN_QUEUE_STAGES (markdown 제외) 만 처리 - _process_stage / _load_workers 헬퍼로 per-stage 로직 공유 - reset_stale_items(stages, threshold_minutes) 파라미터화: main=10min(markdown 제외), markdown=MARKDOWN_STALE_MINUTES(기본 120). marker_worker 는 heartbeat 미기록이라 40분 변환을 10분 stale 로 오인하던 함정 차단 - enqueue flow (classify -> embed,chunk,markdown) 불변 STT/deep_summary 분리 + GPU 동시성 튜닝은 out of scope (follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 10:33:45 +00:00
hyungi	826f66f8f5	fix(search): correct large-doc manifest wording after commit 4 drop PR-DocSrv-LargeDoc-Split-Markdown-1 follow-up (plan brisk-paging-quokka.md). commit 4(marker_section→document_chunks) 드롭으로, split md_content/manifest 의 「권위 검색본 = document_chunks (source_type=marker_section)」 문구가 실제와 불일치. 실제 = 검색 인덱스는 기존 document_chunks(extracted_text long_pdf window chunks), marker_section chunk 부재, md_content 는 Markdown 렌더링 preview. - _build_large_md_content 헤더: 「검색 인덱스 = 기존 document_chunks long_pdf/ extracted_text window chunks. 아래는 Markdown 렌더링 preview.」 - _split_manifest: canonical_storage(marker_section) → search_index(legacy/extracted_text) - 상수 주석 + _process_split docstring: commit 4 드롭/이중적재 회피 반영 뷰어에 없는 source_type 으로 디버깅 오도 방지. 이미 처리된 5 docs 의 md_content 는 즉시 재처리 X — 자연 reprocess 시 갱신(사용자 결정). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 09:48:03 +00:00
hyungi	cf0d75fe84	fix(search): handle markdown/fileless docs without marker conversion PR-DocSrv-LargeDoc-Split-Markdown-1 commit 5 (plan brisk-paging-quokka.md). 이미 마크다운인 문서는 marker 변환 불필요 → _process_markdown_passthrough 로 파일 내용(없으면 extracted_text)을 md_content 에 직접 적재(success), 비면 skipped. - _is_markdown_doc: file_format=md/markdown 또는 .md/.markdown 확장자 - 분기 위치 = file_path validation 이전 (fileless md = file_path NULL 처리 위함) - engine=passthrough 로 marker 변환본과 구분 기존 버그 해소: fileless md 43건=「no file_path」 fail / .md 파일=unsupported extension skip → 둘 다 md_content 미생성이었음. 검증(docker cp 격리): 13948(.md+file_path)→success md_len=1805(파일) / 23409(fileless 931자)→success(extracted_text) / 20237(fileless 6자)→success. PDF 경로 무영향(_is_markdown_doc=False). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 08:02:30 +00:00
hyungi	7aaabe2c75	feat(search): split markdown processing for large PDFs (>threshold) PR-DocSrv-LargeDoc-Split-Markdown-1 commit 3 (plan brisk-paging-quokka.md). - page_count gauge 분기: 소형(<=120p)=_process_single 통째 1-shot / 대형(>120p)=_process_split - MAX_PAGES=200 hard skip 제거 → 대형은 BATCH_PAGES=40 page-range 윈도우 순차 변환 - 각 batch /convert start_page/end_page(1-based) 호출 + slug 충돌 회피 batch별 ref rewrite + stitch - _persist_images_to_nas seq_offset → batch 간 image_key(img_NNN) 연속 - md_status success/partial/failed (전부/일부/전무) + failed batch manifest JSON - 대형 md_content = head+manifest (LARGE_DOC_MD_CONTENT_HEAD_CHARS=50000), canonical=document_chunks(commit 4) - MARKER_MAX_SPLIT_PAGES=5000 초과 = skipped_too_large 안전상태 검증: G1 소형회귀 doc6675 동일(success,6292,14)/single경로 / G2 doc5180 453p→12batch success manifest+207img(img_001~207 연속) / G4 stuck0 restart0 각batch<300s. 섹션 chunk적재(G3)=commit 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 07:39:49 +00:00

1 2 3 4 5 ...

334 Commits