fix(classify): law_monitor skip 분기 복원 + tier_backfill law 제외

PR-B refactor 과정에서 e88640d 의 process() 진입부 source_channel='law_monitor' skip 분기가 사라져 매일 07:00 신규 법령 분할마다 26B legacy classify(8s) + 26B legacy summarize(10s) + 4B triage(1.5s) 전부 호출되고 있었다. 법령 분리 PR (stateless-churning-raccoon) 의 명제: "법령은 외부 source-of-truth + immutable + 자동 재수집 → 다른 수명주기" 와 일치하도록 process() 진입부에 skip 분기 복원. 최소 필드 (ai_domain='법령', ai_tags=['법령'], importance='medium') 만 세팅 후 return. queue_consumer 의 NEXT_STAGES['classify']=['embed','chunk'] 가 자동 chain 하므로 검색 영향 0. 법령 도메인 AI 산출물 가치 분석: - ai_summary: 법령 해석 환각 위험 (ASME/안전 엔지니어 사고 책임 소지) - ai_tldr/bullets: 이미 title 이 같은 정보 노출 — redundant - ai_inconsistencies: 공식 정합 문서라 100% false positive → 비용 (월 ~14분 26B 점유) 대비 가치 음수, skip 합당. tier_backfill.py 도 함께 수정: - DOMAIN_PRIORITY 에서 ('law', source_channel='law_monitor') 항목 제거 - safety 필터에 source_channel != 'law_monitor' 추가 (기존 ai_domain LIKE 'Industrial_Safety%' 매칭 안에 backfill 기 처리한 법령 doc 들이 잡혀 들어가는 case 차단) - 사유: skip 처리될 doc 을 enqueue 하면 야간마다 enqueue→skip→NULL→ enqueue 무한 루프 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:35:27 +09:00
parent 8427ac886c
commit c6335c9a1e
2 changed files with 25 additions and 4 deletions
@@ -309,10 +309,28 @@ async def process(document_id: int, session: AsyncSession) -> None:
    1) Legacy: classify() → ai_domain/document_type/ai_tags/ai_confidence/ai_suggestion
    2) Legacy: summarize() → ai_summary
    3) PR-B B-1: summary_triage (4B) → ai_tldr/ai_bullets/ai_analysis_tier='triage'
+
+    예외 — source_channel='law_monitor':
+      법령은 외부 source-of-truth (law.go.kr) 보유 + immutable + 자동 재수집.
+      AI 분류는 무가치 + 본문 해석 환각 위험. 26B legacy + 4B triage 전부 skip.
+      최소 필드만 세팅 후 return → queue_consumer 가 embed/chunk 자동 chain.
+      참고: feedback_category_vs_ai_domain_axis.md, plan stateless-churning-raccoon.md.
    """
    doc = await session.get(Document, document_id)
    if not doc:
        raise ValueError(f"문서 ID {document_id}를 찾을 수 없음")
+
+    if doc.source_channel == "law_monitor":
+        if not doc.ai_domain:
+            doc.ai_domain = "법령"
+        if not doc.ai_tags:
+            doc.ai_tags = ["법령"]
+        if not doc.importance:
+            doc.importance = "medium"
+        await session.commit()
+        logger.info(f"doc {document_id}: law_monitor → classify skip")
+        return
+
    if not doc.extracted_text:
        raise ValueError(f"문서 ID {document_id}: extracted_text가 비어있음")

@@ -4,8 +4,11 @@ plan: ~/.claude/plans/swirling-swimming-liskov.md — 백필 장기 운영.

 매 30분마다 트리거되어 (KST 00:00~06:00 시간대에만 실제 enqueue):
  1. 우선순위 도메인별 NULL 문서 25건씩 classify 큐 재투입
-  2. 우선순위: safety > law > manual
-     (drive_sync / memo / news 는 별도 판단 — 본 스케줄러 제외)
+  2. 우선순위: safety > manual
+     (drive_sync / memo / news / law_monitor 는 본 스케줄러 제외)
+     - news/memo: 분야 확정, classify 무가치 (legacy 결정)
+     - law_monitor: classify_worker 가 진입 시 skip 처리 (plan stateless-churning-raccoon.md).
+       backfill 에서 enqueue 해도 skip 만 반복되므로 시작부터 제외.
  3. classify 큐가 이미 많으면 스킵 (MLX 부하 보호)

 사유:
@@ -43,9 +46,9 @@ BATCH_SIZE = 25
 QUEUE_SKIP_THRESHOLD = 40

 # 우선순위 도메인 (첫 번째가 후보 먼저 소진)
+# law_monitor 제외: classify_worker 가 진입 시 skip — backfill 무한 루프 방지.
 DOMAIN_PRIORITY: list[tuple[str, str]] = [
-    ("safety", "ai_domain LIKE 'Industrial_Safety%'"),
-    ("law",    "source_channel = 'law_monitor'"),
+    ("safety", "ai_domain LIKE 'Industrial_Safety%' AND source_channel != 'law_monitor'"),
    ("manual", "source_channel = 'manual'"),
 ]