feat(ingest): devonagent 트랙 Phase 1 ingest 활성화
DEVONagent/DEVONthink 가 발견한 웹페이지를 NAS Web/ drop → file_watcher
ingest → extract 4-tier fallback (trafilatura/sibling-md/readability/bs4)
→ embed + chunk 까지. classify/preview/markdown SKIP.
- source_channel='devonagent' (migration 001 dormant 활성화)
- file_watcher: SCAN_TARGETS 통합 + Web/ rglob + canonical_url dedup +
sidecar 누락 정책 (skip 안 함, web_meta.sidecar_missing=true flag)
- extract_worker: HTML+devonagent 분기 + md_extraction_engine 4-tier 구분
(trafilatura → sibling .md ≥200char → readability+markdownify → bs4_text)
- queue_consumer: enqueue_next_stage 의 extract stage 만 source_channel-
aware override (devonagent → [embed, chunk])
- classify_worker: devonagent safety skip (law_monitor 패턴 mirror,
ai_domain='Web', ai_tags=['Web/{host}'])
- requirements: trafilatura/readability-lxml/markdownify 추가
- docs: devonthink-web-bridge.md 설치 가이드 + first-wins 정책 명시
Phase 1 closure 기준 = 재료 품질 (검색 가능 + 노이즈율 + dedup + 엔진 분포).
활용처(ai_tldr/digest/PKM 회고)는 1-2주 OR 30-50건 관찰 후 별 PR 에서 결정.
Plan: ~/.claude/plans/db-snuggly-petal.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -103,13 +103,36 @@ async def enqueue_next_stage(document_id: int, current_stage: str):
|
||||
§3 추가:
|
||||
stt → [classify] (audio 는 extract 건너뛰고 stt 가 extracted_text 를 채움)
|
||||
thumbnail → [] (video 는 leaf — classify/embed 없음)
|
||||
|
||||
Web/Blog ingest (devonagent 트랙) — plan db-snuggly-petal.md:
|
||||
source_channel='devonagent' 인 doc 의 extract 완료 시
|
||||
classify/preview/markdown 전부 SKIP → [embed, chunk] 만 enqueue.
|
||||
AI 가공 (ai_tldr/ai_bullets 등) 은 별 PR (Mac mini derived-worker).
|
||||
"""
|
||||
# source_channel-aware override (extract stage 만). source_channel 누락 시 _default.
|
||||
extract_override_by_channel = {
|
||||
"devonagent": ["embed", "chunk"],
|
||||
}
|
||||
|
||||
next_stages = {
|
||||
"extract": ["classify", "preview"],
|
||||
"classify": ["embed", "chunk", "markdown"],
|
||||
"stt": ["classify"],
|
||||
}
|
||||
stages = next_stages.get(current_stage, [])
|
||||
|
||||
# extract 의 경우만 doc.source_channel 을 lookup 해서 override 적용
|
||||
if current_stage == "extract":
|
||||
from models.document import Document
|
||||
async with async_session() as lookup_session:
|
||||
doc = await lookup_session.get(Document, document_id)
|
||||
sc = doc.source_channel if doc else None
|
||||
if sc in extract_override_by_channel:
|
||||
stages = extract_override_by_channel[sc]
|
||||
else:
|
||||
stages = next_stages.get(current_stage, [])
|
||||
else:
|
||||
stages = next_stages.get(current_stage, [])
|
||||
|
||||
if not stages:
|
||||
return
|
||||
|
||||
|
||||
Reference in New Issue
Block a user