hyungi_document_server

Author	SHA1	Message	Date
hyungi	51a7c96b56	feat(clause-kb): over-CAP 절 본문 페이지네이션(~11K tok/page)	2026-06-29 23:20:16 +00:00
hyungi	eb83d41ba5	feat(clause-kb): 책 API(절 목차/백링크) + /book/[id] 유기적 책 리더 + persist 스크립트	2026-06-29 23:13:34 +00:00
hyungi	94b172e314	ops(ci): boot_smoke 스키마 어서션 max_migration 361→378 (현재 마이그 헤드) 지난 감사(361) 이후 마이그가 378(이번 publish_outbox attempts/failed 포함)까지 전진 → boot_smoke 스키마 게이트의 하드코딩 기대값 갱신. purge/cand/uq 기대는 동일. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 13:30:53 +09:00
hyungi	832ea72784	fix(publish): backfill 스크립트 after_id 페이징 루프 (overflow 누락 방지) backfill_publish_* 가 단일 호출(after_id=0, limit=PAGE)이라 PAGE 초과분이 누락(경고만)됐다. docstring 은 이미 페이지 반복을 명시했으나 스크립트가 미구현. 함수 반환을 (count, last_id)로 바꾸고 3 스크립트를 last_id 기반 while 루프로 전량 처리. PAGE=5000 bounded tx. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 13:22:36 +09:00
hyungi	5b5353c751	fix(publish): 백필 스크립트 전 모델 import (standalone mapper 레지스트리 완성) app 은 라우터 경유로 전 모델을 import 하지만 standalone 백필 스크립트는 부분만 import → SQLAlchemy mapper 의 string 관계(StudyTopic.sessions->StudySession 등) 해소 실패로 InvalidRequestError. pkgutil 로 models/* 전 모듈 import 해 레지스트리 완성(전부 컨테이너서 import 가능 = app 기동 시 로드되는 것과 동일). 백필 3종 실행 검증: topics 1·cards 65·progress 22 적재. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 22:54:40 +00:00
hyungi	08c5213168	feat(publish): S-4 pub_card_progress 발행 — 카드 SR 상태 read model (study→viewer) DS 가 가진 카드 SR progress row 를 발행(kind=study_card_progress) = read model. viewer C-4 복습큐/미확인 set-difference 재료. plan study-viewer-port S-4. - projection: KIND_CARD_PROGRESS + project_card_progress(card_id·topic_id·last_outcome· last_reviewed_at·due_at·review_stage). ★ALL row(due_at NULL sentinel=암-on-new·terminal 포함) — due-only 발행 금지(sentinel 누락→viewer 미확인 오분류). - enqueue: enqueue_card_progress_publish + backfill_publish_card_progress(필터 없음). - 훅: /study-cards/{id}/rate 의 rate_card 직후(같은 tx·flag 게이트). 단일 write 사이트. SR 계산=DS(sr_schedule 무변경), 발행=결과만. - 카드 삭제 시 progress tombstone 안 함 = DS SR 보존(재승인 복원), orphan 은 viewer C-4 가 로컬 드롭. - scripts/backfill_publish_card_progress.py. py_compile PASS · project_card_progress 단위검증(sentinel due_at=None 보존). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 16:00:10 +09:00
hyungi	af5640ef49	feat(publish): S-2 pub_card 발행 — 검수완료 암기카드 (study→viewer) 검수완료(needs_review=false)·미삭제 study_memo_card 만 발행(kind=study_card, 뷰어 pubstudy.ts getCards 계약 일치). plan study-viewer-port S-2. - projection: KIND_CARD + project_card(format·cue·fact·cloze_text·source_question_id·source_generated_at). - enqueue: enqueue_card_publish = 카드 상태 기반 publish/tombstone 단일화(경로별 가드 기억 회피) + backfill_publish_cards. - 저작훅(study_publish_enabled 게이트): approve-batch(검수완료→발행)·update(수정=재투영/ 검수대기복귀=tombstone)·delete(tombstone). - 발행자격 상실 경로 tombstone(viewer stale 잔류 0): 워커 supersede(재추출 retire)· flag_cards_for_source(소스문제 정정/삭제). 두 fn 은 '발행 중이던'(needs_review=false) id 만 선캡처 반환 → 미발행 카드 스푸리어스 tombstone 회피. - scripts/backfill_publish_cards.py. py_compile PASS · project_card payload 단위검증(getCards 계약 일치). 워커·/published/feed kind-generic 무변경. flag on 환경 배포 시 주제처럼 카드 발행 시작. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 15:58:16 +09:00
hyungi	63457e6afc	feat(publish): S-1 pub_topics 발행 — projection+저작훅+백필 (study→viewer) 주제(study_topic) 메타를 발행 레이어에 실어 viewer 가 주제/회차 단위 퀴즈를 구성하게 한다(현재 topic 이름 미발행이라 불가). plan study-viewer-port S-1. - publish_projection: KIND_TOPIC + project_topic(topic_id·name·exam_round_size). 회차는 미발행 = viewer 가 pub_content(study_question) 의 exam_name/exam_round 로 파생(추가 발행 불요). topic_id = project_question.topic_id 와 동일 DS 식별자라 viewer 문항→주제 상관 키(pub_id 는 opaque 라 상관 키 아님). - publish_enqueue: enqueue_topic_publish + backfill_publish_topics(bounded page, deleted_at IS NULL). 멱등 = 워커 (payload_hash, deleted) 디둡. - study_topics 저작훅(전부 study_publish_enabled 게이트): create(flush→enqueue→ commit) / update(재투영, payload 무변경은 디둡이 rev 안 올림=churn 0) / delete(tombstone, raw DELETE 금지·워커 경유). - scripts/backfill_publish_topics.py: 기존 주제 1회 outbox 적재(overflow 가드). 워커·/published/feed 는 kind-generic(무변경, 실측). flag on 환경 배포 시 주제 발행 시작 → S-3 viewer 수용(generic upsert·kind-filtered read) 선행 전제, 게이트 PASS 됨. 백필 실행·배포순서 cutover 는 deploy 게이트(소프트락)라 본 슬라이스 미포함. py_compile PASS · project_topic payload 단위검증. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-25 13:48:08 +09:00
hyungi	63be005c6f	fix(security): 보안 위생 5건 — library admin 게이트·edit_url SSRF·보안헤더·8080 바인드·하드코딩 비번 제거 M3 library.py: categories POST/PATCH/DELETE + facets POST 를 get_current_user→require_admin (공유 분류 CRUD 를 17주체→admin 한정, news/digest 패턴 정합). M1 documents.py: update_document PATCH 에 edit_url validate_feed_url 가드 — 내부/메타데이터 주소 후속 fetch(fulltext_worker) latent SSRF 차단(API 레이어 무방비 해소, news.py 동형). Caddyfile: 보안 헤더(nosniff·X-Frame SAMEORIGIN·Referrer-Policy·-Server). HSTS 는 edge 소관. compose: caddy 8080:80 0.0.0.0→127.0.0.1 (LAN 우회 차단, 실 ingress=home-caddy→caddy:80 도커망). scripts: 하드코딩 죽은 DB 비번 → os.environ (1차 감사 누락분, .env 한정 점검이 놓침). 별도(DB): test-% 계정 12개 비활성화 (공유풀 주체 17→5, 랜덤해시라 비번노출 아님·위생). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-20 05:48:02 +00:00
hyungi	381fcfc675	ops(ci): 전체 app 부팅 스모크 (boot_smoke.py) — GPU 격리 deploy-blocker 게이트 lifespan 실 경로(init_db + 전 worker import + 전 add_job)를 prod 이미지 컨테이너 + ephemeral PG 로 실행해 router/worker import 오류·잡 등록 오류를 검출. NAS/scheduler.start/ prewarm 3개 부작용만 중립화(prod/AI 무접촉). GPU 실측 PASS: routes=173·jobs=34·schema 361·health ok. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-16 15:49:28 +09:00
hyungi	3ff1d7c65d	fix(migrations): R1 baseline 런타임 버그 3건 — init_db asyncpg 경로 (R1 fix) ★실제 init_db() 런타임 검증(psql migration_smoke 가 못 잡는 asyncpg 경로)에서 발견·수정: 1. baseline 덤프에 CREATE TABLE schema_migrations 포함 → init_db 가 IF NOT EXISTS 로 선-CREATE 후 baseline 이 재-CREATE 충돌. --exclude-table=schema_migrations 재덤프(init_db 가 소유). 2. baseline 은 multi-statement 인데 exec_driver_sql(asyncpg prepared)은 multi-statement 불허 ('cannot insert multiple commands into a prepared statement'). raw asyncpg simple 프로토콜 execute() 로 적재(같은 connection = 트랜잭션 내). 3. 마이그 360(10 DROP)·361(DELETE+CREATE)이 multi-statement → init_db 적용 실패. 360=콤마구분 단일 DROP, 361=단일 CREATE UNIQUE INDEX(prod 중복0·fresh 빈테이블이라 dedup DELETE 불요). ★검증: scripts/ci/initdb_runtime_test.py 로 실제 init_db 2회 — 1st(fresh: baseline 262 스탬프 + 359/360/361 적용, documents·purge_col·cand_drop·attempt_unique 전부 확인), 2nd(멱등 skip) PASS. psql migration_smoke 도 PASS 유지. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-16 14:59:47 +09:00
hyungi	0d3c841577	feat(migrations): 스키마 baseline 스냅샷 — fresh-DB/DR 부팅 fix (R1) R0 가 입증했듯 migrations/ 전체 replay 는 011(view active_documents 가 documents.embedding 의존, DROP COLUMN CASCADE 부재)·326(enum-same-txn) 등 누적 비-replayable 로 깨져 신규/DR 환경 init_db 부팅이 불가능했다. 표준 squash baseline 로 해소: - migrations/_baseline/0358_schema_baseline.sql: prod 스키마 스냅샷(pg_dump --schema-only --no-owner --no-privileges, psql 메타·search_path='' 정리 = asyncpg exec_driver_sql 호환). - init_db._load_baseline_if_fresh: documents 테이블 부재(fresh) 시 baseline 적재 + schema_migrations 1..358 스탬프 → 이후 post-baseline(359/360)만 적용. ★기존 DB(documents 존재)는 skip = prod 무영향(additive). baseline 부재 시 기존 replay 경로(하위호환). - migration_smoke: baseline 경로 검증. ★실측 — 이전 FAIL(011 abort) → 이제 FRESH/INCREMENTAL 모두 PASS (pg16.14). cutoff(_BASELINE_CUTOFF=358) 갱신 시 baseline 재생성. 검증: py_compile + migration_smoke PASS. ★boot-path 변경이라 deploy 전 staging 부팅 검증 필수. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-16 14:16:21 +09:00
hyungi	d8ad097a3a	ops(migrations): fresh-DB/DR replay·enum 스모크 게이트 (R0) init_db 의 단일 트랜잭션 적용 경로(engine.begin)를 미러해 migrations/ 전체가 빈 DB / DR(pre-320 → catch-up) 업그레이드에서 한 트랜잭션으로 적용 가능한지 검증. pg16(pgvector/pgvector:pg16) 핀, ephemeral 컨테이너 자동 기동/정리. 현재 두 시나리오 모두 011_embedding_1024 에서 FAIL — view active_documents 가 documents.embedding 의존(DROP COLUMN CASCADE 부재). enum(326) 이전 지점. fresh replay 가 한 번도 검증된 적 없어 누적 비-replayable cruft 다수 확인. R1(스키마 baseline 스냅샷)으로 fix 후 PASS 가 게이트 기준. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-16 13:11:55 +09:00
hyungi	ebbcaf86d8	feat(observability): 큐 밖 백그라운드 작업(backfill)을 처리 머신 보드에 노출 processing_queue 는 파이프라인 stage 전용이라 hier_overnight_backfill 같은 off-queue 관리 스크립트 작업이 대시보드 보드에 안 잡혀, 다른 세션이 모르고 fastapi 를 재생성해 in-flight 재분해를 끊는 사고가 발생(2026-06-14). 사각지대 해소. - migrations/357_background_jobs.sql: background_jobs 테이블(kind/label/state/processed/ total/heartbeat). worker_jobs(user_id 필수, worker-pool 전용)와 별개. - services/background_jobs.py: start/heartbeat/finish 헬퍼 — 자율 트랜잭션(즉시 commit → 실시간 가시화) + best-effort(관측 실패가 본작업 안 깸). - hier_overnight_backfill: 작업 시작/절 ~10개마다 heartbeat/종료 계측. - queue_overview: /api/queue/overview 응답에 background_jobs 추가(running + 최근 6h 완료, stale=heartbeat 끊김 추정). SAVEPOINT 로 테이블 부재/오류 시 보드 본체 무영향. - ProcessingFlowBoard: "백그라운드 작업" 패널(진행/경과/state, stale 끊김 경고). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-14 12:27:18 +09:00
hyungi	9a7e231dcc	fix(safety): verify_statute_chain sys.path — /app 루트 자동 탐지 (workers import) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 12:44:58 +09:00
hyungi	1646617a31	feat(safety): B-1 PR③ — 법령 체인 검증 3술어 스크립트 (read-only 진단) plan safety-library-1 B-1 PR③. E-1 법령 게이트 도구 겸용 (반복 실행 안전): - ① 존재성: watch family 각 primary current 정확 1건 + annex 시리즈당 ≤1 - ② 노출 유일성: primary current 보유 family당 노출 1건 (③a에 흡수) - ③ 고아 그물: 정규화 동등 매핑 — flip 누락(current family 노출 레거시)·무매핑(매핑 구멍) 0 - repealed family ①② 면제. 종료코드 0/1 (관찰 게이트용) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 12:44:25 +09:00
hyungi	0c8fb41366	fix(safety): backfill text() 콜론 bind 오인 — exec_driver_sql 로 교체 정규식 '(?:' 의 콜론을 text() 가 bind param 으로 해석 (migration 러너 동일 함정). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 06:49:58 +09:00
hyungi	e5ddd0e4d6	feat(safety): A-3 backfill 스크립트 — 기존 코퍼스 분류 축 소급 (교정 술어) plan safety-library-1 A-3. prod 실측 반영: - KGS frontmatter = 'code' 키 확정(117/118, kgs_code 0) → 경로 술어 - 레거시 law 243건 — extract_meta 빈값, title '(YYYYMMDD)' 공포일 추출 - GUIDE ofancYmd = 'YYYY-MM-DD' 실측 - KOSHA 본문 = source_id JOIN (kind='case' 부재 — R2 blocker 교정 그대로) - dry-run = 트랜잭션 ROLLBACK 방식 (정확 rowcount + 검증표, 변경 0) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 06:48:30 +09:00
hyungi	d3aa640f65	feat(documents): hier `analyze` 서브커맨드 — 재분해와 독립한 절분석 self-heal (g3-t3 갭) re-decompose 의 char_start 완료마커는 'jump-target char_start 보유'라 컨테이너 recreate/deadline 으로 analyze 가 잘린 doc(char_start 있으나 일부 leaf 미분석)을 재선별 못 함 → rail summary 영구 미수렴 갭. `analyze` 가 LEAF_SQL(미분석 leaf 보유) 기준 독립 선별로 수렴(멱등, --doc 제한 가능, jump 무관). sweep 로그도 `analyze` 커맨드 안내로 갱신. (2026-06-10 백필서 recreate 로 잘린 5 doc·53 leaf 수동 처리한 케이스 항구화.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-11 11:16:44 +09:00
hyungi	e10ccc9169	fix(documents): g-measure junk 검출 all-caps 과탐 제거 + verdict=coarse 스크린 명시 전부-대문자 휴리스틱이 기술문서 정상 heading(GENERAL REQUIREMENTS/WELDING) 130건 과탐 → windowed/clean doc 거짓 A_better 강등. 회사-접미사(INC./LLC…)만, cover 영역(앞 4노드)+미stored 게이트. verdict 는 coarse 스크린(감사용)이고 실집행 결정 = 결정적 partition + 적대 워크플로임을 docstring 박제. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-11 11:16:44 +09:00
hyungi	1842f27d89	feat(news): crawl-24x7 사이클 2 — B-2/B-3/C-1/C-2/C-3/C-5 (마이그 324-326) - 채널 인지화: news_sources.source_channel(324, documents enum 재사용) → 문서 생성 정체성(_doc_identity)·embed/chunk 30일 게이트(crawl=전량 색인)· extract 후속 override(crawl→classify, preview 스킵) 분기. - B-2 Guardian Open Platform: API 디스패치(호스트 분기, 미지 호스트=명시 실패) + show-fields=bodyText 전문 어댑터. fixture live 박제 + call-shape 테스트. - B-3 구독지: playwright-fetcher 격리 컨테이너(동시 1·요청당 브라우저·storage_state ro mount) + politeness 사람속도(30-60s) 브라우저 경로 + fulltext 인증 라우팅 (내용 기반 probe 게이트·relogin_requested 소비=open-스킵보다 앞·본문 페이월 마커 게이트) + source_health probe 컬럼(325) + 세션 박제 스크립트(맥북용). - C-2 KOSHA: 3 API live 검증·fixture 박제(board/attach/guide) — 재해사례 daily diff +첨부 PDF/HWP→extract 파이프라인, GUIDE 일일 cap 점진 백필(silent cap 금지 로그). 키는 URL 직결합(재인코딩 함정 회피). daily 06:40 KST. - C-3 정적 코퍼스: National Board 86 + TWI job-knowledge 153 일괄 CLI(멱등·politeness ·crawl_raw 보존·fulltext_worker 승격 필드 규약 동일). - C-1/C-5 시드(326): 전 URL live 검증 — UK HSE(feed-full)/안전신문/고용노동부 3종 (rss/*.do)/OSHA/EU-OSHA(후보)/SEP/1000-Word(feed-full)/Doing Philosophy/Aeon/Psyche (skip-video quirk). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:08:18 +09:00
hyungi	55216271a6	feat(markdown): hwp raster 이미지 NAS 영속 + library backfill 스크립트 pyhwp(hwp5html) 가 bindata/ 로 추출하는 raster 이미지를 NAS 에 영속한다. 기존엔 변환 tempdir 와 함께 폐기돼 경고 없이 silent 유실(도식·수식)이었다(적대 리뷰 MEDIUM). - office_md.py: _run_hwp5html 으로 hwp5html 1회 실행 → (markdown, raster_images). convert_hwp_to_md_and_images() 신규 = marker_worker 이미지 경로용. hwp5html 은 이미지를 본문 xhtml 에 <img> 앵커하지 않아(--css/--html 동일) 인라인 위치 복원 불가 → 호출부가 말미 갤러리로 부착. OLE 수식/도형은 앵커도 raster 도 아니라 영속 제외. - marker_worker._process_office: .hwp raster 를 marker(PDF)의 _persist_images_to_nas 로 NAS 영속 + document_images UPSERT(_sync_document_images, 재변환 orphan 정리) + md 말미 ## 첨부 이미지 docimg: 갤러리 + quality.warnings hwp_images_appended. docx/xlsx/pptx/ hwpx 는 이미지 미처리(기존 동작 유지). - scripts/backfill_hwp_library.py: 지정 PKM 폴더 .hwp 를 content-hash dedup(Inbox 중복 + _1/카피본 사본 흡수) 후 category=library 일회성 ingest. 검증(E2E): Knowledge/Engineering 18개 → dedup 후 신규 5개(산업안전기사 3~7과목) ingest, 5/5 success. 제4과목 raster 3장 → NAS extracted_images/35778/img_001~003.jpeg 실재 + document_images 3 row(engine=pyhwp) + md 갤러리 docimg ref. 이미지 없는 문서는 갤러리 미생성. 텍스트/표 경로 회귀 0(기존 4건 재변환 success). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 05:10:45 +00:00
hyungi	448195637b	fix(documents): g-measure verdict 를 jump-target 대 jump-target 비교로 정정 hier_outline_quality_gate 의 keep-better verdict 가 build jump-target(n_b, window-child 제외)을 stored leaf 전수(n_a, window-child 포함)와 비교 → windowed doc 이 n_a≫n_b 로 거짓 A_better 강등되던 bias 제거. stored 도 jump-target((비-window leaf OR %_split)+제목)만 카운트. 정정 후 hash_stable 31(≈MEASURE2 32, fence-flip 1)·dup_title 8·in_corpus 3(5140/5186/5225) 전부 UPDATE-only = MEASURE2 와 정합. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 11:54:01 +09:00
hyungi	aeb9290cbd	feat(documents): hier 절 char_start offset (Path B) — md_content 점프 builder offset 플랜 ds-outline-anchor-b5 (g1~g6 코드). 핵심 ASME/법령 windowed 절의 0% 점프를 서버계산 char_start(builder offset)로 100% deterministic 점프로 전환. - g1 migration 318: document_chunks.char_start INTEGER NULL (단일 statement, 멱등) - g2 builder: char_start emit = FE 라인/offset 모델 미러(split('\n')+UTF-16 code unit+코드펜스 skip). window-child=NULL, split-parent=heading offset, preamble=NULL, CR 미strip, NFC=telemetry. node.text 보존(라인모델 hash-neutral) → hash_stable doc 보존. 단위테스트 7건. - g3 persist+backfill 하이브리드: * persist INSERT char_start * update-char-start (g3-tU): hash_stable doc 비파괴 — 100% jump-target VERIFY(NEW-1) + position-aligned PK UPDATE(NEW-2), 미달 doc DEMOTE → re-decompose 합류(NEW-4) * --reprocess (g3-t2): md_content 출처(g0-t1) + jump-target-set 완료마커(B1) + B_jumptarget>=1(B3), --doc 필수 else REFUSE. self-heal sweep(g3-t3). - g4 /sections: char_start inner+outer SELECT + split-parent 노출(is_leaf OR %_split) - g5 FE: resolveAnchorMap(BE-first, NEW-5 jump-target-candidate-scoped 폴백, C1 OR-exclude), per-render-site basis guard(C3), endsWith('_split') 정정 + collapseWindows split-parent 흡수(C2). 단위테스트 25건(NEW-5/B4/C1/C2 포함). - g6 hier_outline_quality_gate.py: read-only g-measure(verdict/B_jumptarget/hash_stable/dup/fence) 배포(g7: --no-deps, 스냅샷, UPDATE-only 32 + re-decompose 230∪demote, 정확도 게이트)는 별 ops 단계. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 10:12:26 +09:00
hyungi	c8d8df6b2d	fix(migrations): s1 dedup 287->317 renumber (main 287=study_memo_cards 충돌 회피)	2026-06-08 03:07:53 +00:00
hyungi	daf6a0ade9	feat(documents): S1 dedup·office-md·storage scaffold (B/C/D/E) plan ds-s1-backend-1 잔여 구현 (A·C-1 은 `16b0fe1`): - B 중복검사: services/dedup.py (OFF-list law_monitor 공용) + 업로드 채움(B-1) + GET /documents/duplicates(B-2) + post-upload near-dup 비동기(B-3) + backfill_dedup.py(B-4) + 야간 dedup_reconcile 잡(03:30 KST 멱등 재계산) - C MD-first: marker_worker office/hwp 분기 _process_office(C-2) + md_status 상태머신 postcondition success\|failed(C-5) + backfill_nonpdf_markdown.py(C-4) + requirements markitdown - D 스토리지: services/storage ABC+Range 계약 / LocalBackend / NasApiBackend 503 (D-1) + /file resolver 경유, 로컬 동작 불변(D-2) - E 운영: pre-change pg_dump + rollback_287.sql + apply runbook(E-3) + 테스트(E-1) 비파괴 불변식 유지(기존 응답 shape 무변경, md_status success→completed read-time 매핑). 어드버서리얼 리뷰 확정 1건(soft-delete canonical 승격 시 stale duplicate_of) → B-1 승격 정규화 + 야간 재계산으로 정합. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 03:05:30 +00:00
hyungi	68e2d7ea04	feat(documents): S1-ADD dedup·원본명 3컬럼 + md_status success→completed 매핑 (A) + office→md PoC (C-1) plan ds-s1-backend-1 (r5 수렴). 코드만 스테이징 — migration 미적용(restart 보류, E-2 Soft Lock 예외창). A (앱 v1 디코딩 비파괴 최소선): - A-1 migrations/287_documents_dedup_fields.sql: original_filename TEXT / duplicate_of BIGINT FK ON DELETE SET NULL / duplicate_count INTEGER NOT NULL DEFAULT 0. 단일 statement·PG16 fast-path·BEGIN/COMMIT 금지. backfill 미포함(B-4). - A-2 app/models/document.py: 1계층 블록에 3 mapped_column (+ ForeignKey import). md_* 는 기존. - A-3 app/api/documents.py: DocumentResponse 3필드(duplicate_count=0 non-opt) + DocumentDetailResponse field_validator(success→completed, mode=before) — read-time DB→API 단방향, write(ORM) 미적용. - A-4 tests/test_s1_dedup_shape.py: success→completed 동작 + 비-success 통과 + 3필드 디폴트/roundtrip + ds-app contract fixture 디코드(skip-if-absent). py_compile OK. ★ backend 절반 — 전체 비파괴는 S3 render 테스트와 AND. C-1 PoC (워커 미연결 — C-2 에서 marker_worker 분기 연결): - app/workers/office_md.py: OOXML=markitdown(신규 dep, lazy) / hwp·hwpx=LibreOffice headless→HTML→markdownify(기존 dep). 실패·빈출력·타임아웃·dep부재 → OfficeMdError raise (success+빈md 금지 = C-5 postcondition 의 변환기 계약). - scripts/poc_office_md.py: 표 fidelity 측정 하니스. E-1 = prod LibreOffice 버전핀 안전컨텍스트 실행(hwpx 필터 버전 의존). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-08 03:05:30 +00:00
hyungi	fc9e0f1d8f	feat(search): hier backfill --skip-analysis + --doc gate-bypass flags PR-DocSrv-Hier-Replace-Diagnose-1 c2. 구조화 소형 문서(법령 등) eval coverage 보정용 — --doc 명시 리스트로 DOC_MIN_CHARS=4000 게이트 우회, --skip-analysis 로 절분석(Mac mini) 생략하고 분해+임베딩만. retrieval go/no-go 측정 준비. additive, in_corpus 무영향. NOT EXISTS hier 멱등 가드 유지. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 05:21:00 +00:00
hyungi	ec174fc1e7	ops(hier): default backfill scope to all-except-news 기본 범위 = 뉴스 도메인만 제외, 나머지 전부(>4000자 미분해). --domains 로 allowlist override. 신규 후보 50건(general 29 + programming 13 + engineering 8). additive(in_corpus=false) 유지. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 22:51:13 +00:00
hyungi	c2f9dca62d	ops(hier): add section analysis backfill runner hier 분해(additive, in_corpus=false) + 절 분석(Mac mini gemma-26B BACKGROUND gate) 오버나이트 backfill 러너. time-box deadline + per-doc commit + 멱등 선별(NOT EXISTS). section_summary_pilot 상수 재사용(PROMPT_VERSION 단일화). no silent fallback. 검증: Engineering+Industrial_Safety 245 doc / 6066 절 요약 / fail 0 (2026-05-24~25). 컨테이너 TZ=UTC → deadline KST 환산 주의. 종료는 컨테이너 내부 PID kill 필수. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 22:47:06 +00:00
hyungi	cfadaaffd9	feat(search): hier section per-leaf analysis scaffold (Section-Summary-1 c1) chunk_section_analysis 테이블(migration 286) + ORM model + pilot script. document_chunks(retrieval-hot)와 분리된 절-레벨 분석 축. domain 상속, section_type 절-전용 역할 enum, status로 skip 박제, source_content_hash로 stale 탐지. script-only(scripts mount, rebuild 불필요). LLM 0 dry-run 검증 = 5225 147 analyze + 17 skip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 13:45:30 +00:00
Hyungi Ahn	f60d6e52fc	feat(worker-pool): Registry-1B Pull 활성화 (auth + worker_jobs + 5 endpoint) worker-pool-policy §B 1B 영역 완료. 1A scaffold (mig 270~274 + 503 stub) 위에: - mig 275/276: worker_jobs (status CHECK + user_id=owner) + pending partial index - create_laptop_worker_bot_token + require_worker_user dependency (voice-memo 동형) - /internal/worker/{register,heartbeat,claim,result,drain} 5 endpoint 실 구현 - /claim FOR UPDATE SKIP LOCKED + 204 body 0 - /result 소유권 검증 (worker_id 매칭, 404) + failed 재시도 (attempts/max) - explicit failure 시 request.result 무시 (DB result NULL 유지) - 테스트 22 항목 7 파일 policy §B.2 5 invariant 보존: voice-memo wrapper 변경 0, drain advisory, result raw JSONB, ProcessingQueue 무변경, 운영 자동 분기 변경 0. 활용처 (recap context + /jobs/recap + payload 100KB guard) = Registry-1C 영역. stale recovery / 노트북 client / canonical promote = Notebook-Pilot-1 영역. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 08:54:07 +09:00
hyungi	74876b674c	feat(auth): JWT iat + users.password_changed_at invalidation (PR-Docsrv-JWT-Invalidation-1) PR-Infra-Sec-1H Phase 0 audit 에서 DS jwt invalidation 정책 부재 확정. password rotation 으로 구 365d JWT (voice-memo-bot 등) invalidate 안 되는 hard gate STOP 진입 → 선행 PR 분리. - migration 269: users.password_changed_at timestamptz NULL (legacy 호환) - create_access_token / create_refresh_token: payload 에 iat (int 초) 추가 - verify_password_changed_at helper: int(password_changed_at.timestamp()) > int(iat) 시 401 - get_current_user + refresh_token route: verify helper 호출 - change_password / setup signup / seed_admin INSERT+UPDATE: password_changed_at 갱신 NULL = 검증 skip (migration 직후 운영 영향 0). 첫 password 변경 후만 iat 검증 활성. Sec-1H 의 G-token-old hard gate 통과 path 확보.	2026-05-17 06:20:46 +00:00
Hyungi Ahn	73734d5585	fix(news): backfill INTERVAL bind 을 make_interval(days=>:days) 로 교체 asyncpg 가 :days \|\| ' days' 의 int → text 암묵 변환을 거부함. make_interval 사용으로 int 그대로 바인딩 가능. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:40:11 +09:00
Hyungi Ahn	78b8b52a86	fix(news): backfill script sys.path 컨테이너 호환 (parent.parent / 'app' 또는 parent.parent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:39:47 +09:00
Hyungi Ahn	08cf676c26	fix(news): news 문서 chunk stage enqueue 추가 + 7일 백필 스크립트 document_chunks.country 가 7일 분포 기준 99.9% NULL 이었던 root cause = news_collector 가 summarize + embed 만 enqueue 하고 chunk 를 enqueue 하지 않아 chunk_worker 가 news 문서에 한 번도 안 돌고 있었음. queue_consumer.next_stages 의 summarize 키 부재가 follow-up 미연결 원인. news 외 summarize 흐름 부수영향 회피를 위해 next_stages 가 아니라 news_collector RSS/API 양쪽에 chunk enqueue 1줄씩 명시 추가. days_old <= 30 가드 안에서 embed 와 동일 정책. scripts/news_chunk_country_backfill.py — doc 단위 small batch, 실패 doc skip, 50건마다 progress. queue 우회 직접 chunk_worker.process 호출로 timing 통제. Gate (PR closure): A) chunked_doc_pct > 95% 최근 7일 news doc 중 chunk 보유 비율 B) country null_pct < 5% 최근 7일 news chunk country NULL 비율 plan: ~/.claude/plans/7-whimsical-crab.md (PR-News-Prep-Layer-1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:35:53 +09:00
Hyungi Ahn	a6b8dae18e	fix(gpu-health): container_ip() 가 document_server network IP 만 추출 ollama 는 home-gateway-network / document_server / ollama_default 3개 network 에 속해 range loop 가 모든 IP concat. (index .NetworkSettings.Networks "hyungi_document_server_default").IPAddress 로 명시. 다른 GPU 서비스 4개도 동일 single-network 이라 호환. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 10:02:54 +09:00
Hyungi Ahn	8f4413a38c	fix(gpu-health): scripts 호출 도구를 host curl + container IP 로 통일 OCR/STT 컨테이너 안에 curl 미설치 (slim python image). docker exec curl 표준은 실측 OCI exec 실패. host curl + docker bridge IP (172.20.0.x) 로 변경 — host publish 추가 아니라 docker network 내부 검증이라 보안 표면 동일. reranker 만 curl 있고 OCR/marker/STT 는 python 만 있어 분기 발생을 회피. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 09:51:59 +09:00
Hyungi Ahn	98ee7dffe2	ops(gpu-health): GPU 서비스 health/smoke 표준화 + synthetic VRAM 피크 가드 PR-GPU-Health-1. 운영 준비성 표준화 PR (모델 성능 개선 아님). - OCR /smoke endpoint 추가 (160x60 OK PNG in-memory, 200/503 분기, Docker healthcheck 미사용) - marker /health endpoint 추가 (stt/ocr 동일 시그니처) - reranker docker-compose healthcheck 추가 (TEI :80/health) - scripts/gpu_service_smoke.sh: docker exec 표준 점검 (OCR/STT expose-only) - scripts/gpu_vram_fixture.sh: Mode A sequential + Mode B light overlap + --stress 옵션 - tests/load/fixtures/: synthetic ocr_ok.png / sine_30s.wav / lorem_1p.pdf OCR 빈 응답 false negative — root cause: ports 미매핑. 결정: ocr-service / stt-service 는 expose-only 유지, 운영 점검은 docker exec 내부 curl 표준. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 09:42:07 +09:00
hyungi	25ee10ac34	feat(scripts): Phase 2 markdown backfill — script + README - scripts/phase2_backfill.py: 5 subcommands - inventory: pending PDFs dry-run CSV with skip forecast - select-canary: stratified 40 sample (seed 20260503) - enqueue: one-shot from sample CSV (--no-dry-run gate) - nightly-enqueue: cron-friendly with disable flag / marker /ready / active-queue threshold (oldest_age stuck guard) / DB pool guards - post-report: final state CSV + 1D baseline comparison MD - evals/markdown/README.md: Phase 2 section appended - plan: ~/.claude/plans/iridescent-gathering-clover.md - depends on Phase 1B handwritten skip `7d0fca2` (marker_worker side guard)	2026-05-10 05:47:20 +00:00
Hyungi Ahn	f2a5c729b7	fix(scripts): marker reprocess SQL — CAST(:payload AS jsonb) 로 named-param 충돌 해소 `:payload::jsonb` 의 `::` postfix 캐스트가 SQLAlchemy text() 의 named-param prefix `:` 와 충돌해 asyncpg syntax error. doc 3757 sample reprocess 시 발견. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 14:27:17 +09:00
Hyungi Ahn	68fa86ea52	feat(markdown): persist extracted images with auth routes Markdown Canonical Phase 1B.5 — marker 가 추출하던 이미지를 NAS 에 영구 저장하고 DB 메타 + 인증 라우트 + 프론트 swap 까지 wiring. 핵심 변경: - marker-service /convert 응답에 base64 image 리스트 포함 (stateless 유지, NAS write 권한 X) - marker_worker 가 NAS `/documents/extracted_images/{doc_id}/` 에 persist + UPSERT + 고아 row DELETE + md_content ref 를 `docimg:img_NNN` stable scheme 으로 정규화 - /api/documents/{id}/images/{key}/raw 인증 라우트 (Cache-Control private + ETag = content_hash) - frontend MarkdownDoc 가 placeholder card 안의 docimg ref 를 실제 <img> 로 swap 원칙: - 이미지 binary = NAS, metadata = Postgres (학습 섹션 패턴 동일) - image_key sequence 기반 결정적 → 재변환 idempotent - MARKDOWN_IMAGE_PERSIST=false env 로 rollback 가능 (placeholder card 폴백 자연 유지) 기존 28건 marker success 문서는 본 PR 에서 건드리지 않음 — deploy + 신규 업로드 1건 + sample 5건 검증 후 scripts/marker_reprocess_existing_success.py 로 targeted reprocess. plan: ~/.claude/plans/piped-humming-crystal.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 14:05:41 +09:00
Hyungi Ahn	0362f52130	fix(scripts): Phase 1D enqueue 가 existing_success 재처리하지 않도록 필터 Round 2 sample 에 existing_success 5건 (anchor doc 4809 + calibration 4) 이 포함되었지만, cmd_enqueue 가 sample_source 무시하고 30건 전부 enqueue 하던 버그. 결과: - existing 5건 marker 재처리 (~25분 marker 시간 낭비) - 동일 quality output 으로 md_content overwrite → baseline 유실 - anchor (doc 4809) 의 "before" 상태가 사라져 후속 라운드 비교 anchor 손상 Fix: - default = sample_source == "controlled_backfill" 만 enqueue (25건) - --include-existing flag 추가 (후속 Marker 튜닝 라운드에서 anchor 재처리 필요 시 사용) - print 로 mode 명시 + 제외된 ids 표시 야간 단발 sweep (23:00 KST) 예약 실행 전 fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 16:27:31 +09:00
Hyungi Ahn	b09687d41d	feat(scripts): Phase 1D Round 2 — controlled backfill stratification 기존 phase1d_pilot.py (단순 ai_domain × file_size 3-bucket) 를 plan ~/.claude/plans/stratified-mingling-otter.md 의 4축 + sample_source 분리 + forced_include 로 augment. Round 1 (ai_domain × file_size 3-bucket) 의 한계: pending PDFs 의 자연 분포만 반영 → 알려진 약점 (필기/스캔/한중일 mixed OCR) 이 sample 에 안 들어옴. 1C 시각 확인에서 doc 4809 (Note_240805_용접교육 필기) 가 실제로 그 패턴을 보였는데, 자연 selection 에 맡기면 다음 라운드도 같은 case 가 빠질 위험. Round 2 디자인: - 4 축 stratification: doc_type × file_size_band × text_density_band × handwritten_hint - sample_source ∈ {existing_success(5), controlled_backfill(25)} - forced_include doc 4809 — known bad anchor. 다음 튜닝/대안 도입 후 같은 문서 재변환 결과와 1:1 비교 가능. - text_density = LENGTH(extracted_text) / (file_size / 1024) chars/KB 가장 깨끗한 단일 proxy. 0.17(필기 4809) ↔ 94(born-digital 3759) 양 끝 검증. - script_mix proxy: Hangul/CJK/Hiragana/Katakana/Latin Unicode block ratio → korean_dominant / mixed_korean_cjk / mixed_korean_latin / cjk_dominant / latin_dominant / unknown. - page_count_estimate: existing_success 는 md_extraction_quality. metrics.source_page_count 사용. controlled_backfill 은 NULL (marker 가 PyMuPDF 로 어차피 다시 읽음). - 시드 SAMPLE_SEED=20260502 고정, 재현성 보장. Sample 분포 (실측 2026-05-02): bucket_label: born_digital=12, mixed=5, existing_calibration=4, handwritten=3, scan_likely=3, large=2, existing_anchor=1 doc_type: Academic_Paper=7, study_note=6, Standard=5, Note=4, Reference=3, Manual=3, Drawing=1, Report=1 file_size_band: M=14, S=12, L=4 text_density_band: born-digital=15, scan-likely=9, mixed=6 handwritten_hint: lo=26, hi=4 (모집단 1.1% 대비 13배 over-sample) forced anchor doc 4809 = density 0.17 (사용자 시각 확인의 그 문서) 새 subcommand: eval_template — pilot_1d_eval.csv 스켈레톤 (rubric 5축 1~5 + overall_pass + notes). 사용자가 MarkdownDoc + PDF 토글 비교하며 점수 채움. 기존 cmd_enqueue (snapshot/backup/dedup) + cmd_report (quality 메트릭) 는 유지. 산출물: scripts/phase1d_pilot.py — 4축 + sample_source + forced_include + eval_template subcommand. CSV+JSON dual output. evals/markdown/README.md — rubric + decision matrix + workflow guide. evals/markdown/pilot_1d_sample.csv — 30 rows × 15 cols (시드 결과, 재현성 보존). evals/markdown/pilot_1d_eval.csv — 빈 스켈레톤 (사용자 평가 후 채움). 실행 경계: Step 1~3 (selection / template / dry-run) = 본 PR 으로 완료. Step 4 (--yes enqueue, 실제 30건 markdown 큐 인입) = 사용자 timing 승인 + 야간 단발 sweep 윈도우 (23:00~03:00 KST) 안에서 별도 실행. marker-service BATCH_SIZE=1, 30건 평균 5분/건 ≈ 2.5h. Verify: GPU 서버 fastapi 컨테이너에서 select 실행 → 30건 sample CSV 생성됨. eval_template subcommand 동작 확인. enqueue dry-run 으로 30 doc_ids + snapshot 출력 후 사용자 취소 분기 확인. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 16:15:09 +09:00
Hyungi Ahn	6b52d57bac	feat(study): Phase 4-A explanation_md 길이 cap + prompt 강화 운영 데이터에서 ready 박힌 풀이가 793/838/866자 — 권장 200~400 대비 큰 편. 1차 운영 후 결과 화면 가독성 + 토큰 사용량 통제 위해 prompt 강화 + 저장 전 cap. Prompt (study_explanation_envelope.txt): - explanation_md 권장 300~600자, 최대 900자 명시 - 핵심 개념 + 정답 근거 + 헷갈리는 1~2개 오답만 — 모든 오답 풀이 X - explanation_md 안 줄바꿈 최소화 (parse_json fix 와 결합 — invalid escape 줄임) - LaTeX 수식 자제 — \\circ/\\text/\\, 매크로 가능하면 평문 ('0°C', 'C') - 출력은 raw JSON 한 객체만 — 코드 펜스/thinking/메타 X 강조 Worker (study_explanation_worker.py): - _cap_explanation_md(text, max_chars=1200) 헬퍼 신규 · 1200자 이하 passthrough · 초과 시 마지막 200자 안에서 \\n\\n / \\n / '. ' / '다.' / '요.' 경계 탐색 · 경계에서 자르기 + '…' (단어 중간 자르기 회피) · 경계 못 찾으면 단순 자르기 + '…' - save 전 cap 적용. ai_explanation_status='ready' 유지 (cap 됐다고 failed X) - payload 에 운영 분석 metadata: explanation_len_original / _saved / capped 플래그 검증: - tests/test_explanation_cap.py (6 케이스) · short passthrough / exact at limit / paragraph boundary / sentence boundary · no boundary fallback / empty input - scripts/phase4_health.sql 섹션 8/9 추가 · ai_explanation 길이 p50/p95/max (study_questions.ready) · cap 작동 빈도 (job.payload 의 explanation_capped/_original/_saved) cap 1200 = 800 (4-B summary_md) 보다 여유 — 기사시험 풀이는 공식+오답+개념 묶이면 800 빡빡함. 운영 후 800~1000 으로 조정 검토. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 08:33:18 +09:00
Hyungi Ahn	8074be6b6d	feat(study): Phase 4-D 운영 관찰 + confidence calibration Phase 4-B v1 첫 검증 결과 자료 부족 토픽인데도 모델이 confidence='high' 박는 케이스 발견. 정의 (high = 자료 + 다른 ai_explanation 으로 패턴 명확) 보다 과신 — UX 신뢰도 위험. 자동 cap 보정 + 운영 관찰 SQL 추가. confidence calibration (services/study/session_summary_guard): - calibrate_confidence(c, ctx_docs_count, ready_explanation_count) 신규 · ctx_docs_count == 0 AND ready_explanation_count == 0 → 'low' cap · ctx_docs_count == 0 (ready 만 있음) → 'medium' cap · ctx_docs_count >= 1 → 모델 값 그대로 - 모델이 정의보다 더 보수적인 값 박은 경우 (모델 'low' + cap 'medium') 는 보존 — 더 보수적인 값을 절대 올리지 않음 worker 적용 (study_session_analysis_worker): - ctx_docs_count = len(ctx_docs) - ready_explanation_count = sum(1 for a in prompt_attempts if a.get('ai_explanation')) - calibrate_confidence 호출 → study_quiz_session_analysis.confidence 박힘 - job.payload 에 운영 분석 metadata 보존: · ctx_docs_count / ready_explanation_count · model_confidence_raw (모델 응답) vs calibrated_confidence (cap 후) · prompt_attempts / valid_attempts_total / summary_len → SQL 4 번 쿼리가 cap 작동 빈도 측정 scripts/phase4_health.sql (신규 운영 점검 SQL 7 섹션): 1. 4-A study_question_jobs status × error_code 분포 2. 4-B study_quiz_session_jobs status × error_code 분포 3. 4-B confidence 분포 (calibrated) 4. 4-B model_confidence_raw vs calibrated 차이 (cap 작동 빈도) 5. 4-A/4-B 최근 7일 처리 지연 p50/p95/max/avg 6. 4-A/4-B skipped 사유 분포 7. 4-B guard_fail / parse_fail / llm_timeout 비율 ship gate (단위 테스트): - test_calibrate_confidence_no_evidence_caps_to_low (3 케이스) - test_calibrate_confidence_only_explanations_caps_to_medium (3 케이스) - test_calibrate_confidence_with_documents_passthrough (3 케이스) - test_calibrate_confidence_normalizes_invalid_first (2 케이스) Plan: ~/.claude/plans/nifty-sparking-spindle.md (Phase 4-B v1 후속) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 07:33:57 +09:00
Hyungi Ahn	7cab78e490	ops(canonical): Phase 1D enqueue 전 backup + targets + md_status 스냅샷 enqueue 시작 직전 3가지 흔적 남김: (1) /tmp/phase1d_pilot.json 의 timestamped 사본 (재실행 대비) (2) 대상 30건 document_id 한 줄 출력 (3) documents.md_status 분포 스냅샷 JSON 저장 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:00:33 +09:00
Hyungi Ahn	3e831a2dc7	fix(canonical): Phase 1D script sys.path — /app/scripts/.. 가 PYTHONPATH 루트 fastapi 컨테이너는 WORKDIR=/app, 코드가 직접 풀려있고 app/ 디렉토리 없음. backfill_category.py 의 ../app 패턴은 컨테이너 안에서 /app/app (없음) 가 되어 ModuleNotFoundError. 스크립트 자기 디렉토리의 .. 를 sys.path 에 넣어 /app 루트 노출. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 09:50:23 +09:00
Hyungi Ahn	f98cf2e505	ops(canonical): Phase 1D marker pilot one-shot script (select/enqueue/report) 30건 한정 stratified pilot. baseline markdown 품질 측정 후 Phase 2 전체 백필 결정. 영구 worker 경로 아님. 대상 WHERE: deleted_at IS NULL AND file_format='pdf' AND md_status='pending' AND category='document' AND document_type NOT IN SKIP_DOC_TYPES (marker_worker 와 일관) Stratification: ai_domain × file_size_bucket (small<500KB / medium<5MB / large) documents 에 page_count 컬럼 부재 (marker_worker 가 PyMuPDF 로 동적 측정) → file_size 를 길이 proxy 로 사용. cell 안에서 file_size 작은/큰 mix 로 짧은/긴 문서 차이 관찰. Subcommands: select — 30건 dry-run + JSON 저장 (/tmp/phase1d_pilot.json) enqueue — markdown 큐 enqueue (uq_queue_active 충돌 시 skip) report — md_status / 평균 elapsed / 실패 top5 / heading anchor 후보 / KaTeX 후보 / file_size bucket 별 success 비율 / UI 검수 URL 리포트 메모: markdown_image_count 는 현재 server.py 가 _images 버림 → 0 정상. Phase 1B.5 에서 _images 출력 시 자동 활성. 실행: docker compose exec fastapi python /app/scripts/phase1d_pilot.py select docker compose exec fastapi python /app/scripts/phase1d_pilot.py enqueue --yes docker compose exec fastapi python /app/scripts/phase1d_pilot.py report Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 09:49:17 +09:00
Hyungi Ahn	5404343a1a	fix(study): HC-5 block math spacing — KaTeX \$\$...\$\$ 앞뒤 빈 줄 보장 자동 fix 문제: 보기/해설 본문의 \$\$ ... \$\$ block math 가 앞뒤 빈 줄 없으면 마크다운 파서가 라벨/텍스트와 같은 단락에 묶어 KaTeX 렌더 실패 → raw 표시. 운영 결과 (21회분 = 2,100문항): - HC-5 detect 317건 모두 자동 fix 완료. 모든 회차 재검사 0건. - 추가 fix: q1579 (2023년 1회 q81) 바이메탈 ASCII 다이어그램 fence wrap. 알고리즘: - 자체 줄 \$\$...\$\$ (한 줄 안 시작·종료, 길이 4+) detect. - 앞·뒤 라인이 비어있지 않으면 빈 줄 삽입 — idempotent. - inline \$ ... \$ 영향 없음. - 의미 변경 0 (빈 줄 삽입만, 본문 텍스트/수식 보존). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 08:29:39 +09:00

1 2

87 Commits