hyungi_document_server

Author	SHA1	Message	Date
hyungi	690b22fe58	fix(hardening): collect-lock TOCTOU 제거 (R9) + tier_backfill fstring allowlist (R12) - news.collect: locked() 체크 후 실제 acquire 가 별도 task 안에서 일어나 그 사이 다른 요청이 끼어들어 이중 수집 task 가 생기던 TOCTOU. 핸들러에서 동기 acquire + task finally release 로 원자화. - tier_backfill._enqueue_domain: filter_clause 가 SQL 에 직접 보간되나 allowlist 가드 부재 (retrieval_service _VALID_DOCS_TABLE 정본 대비 비대칭). DOMAIN_PRIORITY 출처 allowlist final gate 추가 — 현재 모듈 상수라 injection 0 이나 외부 입력화 시 즉시 차단. 검증: py_compile 통과. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-16 14:07:07 +09:00
Hyungi Ahn	95bcdb851b	fix(ops): backfill 쿼리에 빈 extracted_text 제외 — 무한 retry 방지 3일 운영 결과 doc 4811, 5181 가 extracted_text='' (빈 문자열) 인데 IS NOT NULL 만 걸려 enqueue → classify_worker 의 not doc.extracted_text truthy 체크에서 ValueError → max_attempts(3) 도달 → status=failed. 다음 backfill 사이클에서 다시 enqueue 되어 12회 반복, failed 24건 누적. 수정: tier_backfill.py + backfill_tier.py 양쪽 SQL 에 LENGTH(extracted_text) > 0 추가. 빈 문자열 문서는 enqueue 자체에서 제외. 기존 failed 24건 정리 SQL (사용자가 수동 실행): DELETE FROM processing_queue WHERE stage='classify' AND status='failed' AND error_message LIKE '%extracted_text%'; Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 08:25:12 +09:00
Hyungi Ahn	c6335c9a1e	fix(classify): law_monitor skip 분기 복원 + tier_backfill law 제외 PR-B refactor 과정에서 `e88640d` 의 process() 진입부 source_channel='law_monitor' skip 분기가 사라져 매일 07:00 신규 법령 분할마다 26B legacy classify(8s) + 26B legacy summarize(10s) + 4B triage(1.5s) 전부 호출되고 있었다. 법령 분리 PR (stateless-churning-raccoon) 의 명제: "법령은 외부 source-of-truth + immutable + 자동 재수집 → 다른 수명주기" 와 일치하도록 process() 진입부에 skip 분기 복원. 최소 필드 (ai_domain='법령', ai_tags=['법령'], importance='medium') 만 세팅 후 return. queue_consumer 의 NEXT_STAGES['classify']=['embed','chunk'] 가 자동 chain 하므로 검색 영향 0. 법령 도메인 AI 산출물 가치 분석: - ai_summary: 법령 해석 환각 위험 (ASME/안전 엔지니어 사고 책임 소지) - ai_tldr/bullets: 이미 title 이 같은 정보 노출 — redundant - ai_inconsistencies: 공식 정합 문서라 100% false positive → 비용 (월 ~14분 26B 점유) 대비 가치 음수, skip 합당. tier_backfill.py 도 함께 수정: - DOMAIN_PRIORITY 에서 ('law', source_channel='law_monitor') 항목 제거 - safety 필터에 source_channel != 'law_monitor' 추가 (기존 ai_domain LIKE 'Industrial_Safety%' 매칭 안에 backfill 기 처리한 법령 doc 들이 잡혀 들어가는 case 차단) - 사유: skip 처리될 doc 을 enqueue 하면 야간마다 enqueue→skip→NULL→ enqueue 무한 루프 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 07:35:27 +09:00
Hyungi Ahn	a95294ff42	feat(ops): 야간 auto tier 백필 스케줄러 (PR-B 레거시 해소) 6720건 레거시 문서를 야간에 자동으로 tier triage + deep_summary 처리. app/workers/tier_backfill.py (신규): - APScheduler 30분 주기 트리거. KST 00:00~06:00 시간대만 실제 enqueue. - safety > law > manual 우선순위 25건씩 classify 큐 재투입. - classify 큐 40건 이상 쌓여있으면 MLX 부하 보호로 skip. - drive_sync / memo / news 는 제외 (plan 스코프 밖 또는 가치 낮음). - off-switch: settings.ai.tier_backfill.enabled = false 로 전면 중단 가능. app/main.py lifespan: - scheduler.add_job(tier_backfill_run, interval=30min, id='tier_backfill'). - AsyncIOScheduler 이미 timezone='Asia/Seoul' 로 설정돼 tier_backfill 내부의 zoneinfo('Asia/Seoul') 와 일치. 수치 예상: 야간 6시간 × 2회/시간 × 25건 = 150건/야간. 6720 / 150 = 약 45일이면 전체 레거시 소화. MLX 부하 제어가 가장 강한 관심 — R2 backlog guard 와 중복 안전장치. 운영 중 과부하 감지 시: config.yaml 에 `ai.tier_backfill.enabled: false` 만 넣으면 즉시 정지 (재시작 없이 스케줄러가 매번 체크). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 15:28:28 +09:00

4 Commits