feat(ingest): devonagent 트랙 Phase 1 ingest 활성화

DEVONagent/DEVONthink 가 발견한 웹페이지를 NAS Web/ drop → file_watcher ingest → extract 4-tier fallback (trafilatura/sibling-md/readability/bs4) → embed + chunk 까지. classify/preview/markdown SKIP. - source_channel='devonagent' (migration 001 dormant 활성화) - file_watcher: SCAN_TARGETS 통합 + Web/ rglob + canonical_url dedup + sidecar 누락 정책 (skip 안 함, web_meta.sidecar_missing=true flag) - extract_worker: HTML+devonagent 분기 + md_extraction_engine 4-tier 구분 (trafilatura → sibling .md ≥200char → readability+markdownify → bs4_text) - queue_consumer: enqueue_next_stage 의 extract stage 만 source_channel- aware override (devonagent → [embed, chunk]) - classify_worker: devonagent safety skip (law_monitor 패턴 mirror, ai_domain='Web', ai_tags=['Web/{host}']) - requirements: trafilatura/readability-lxml/markdownify 추가 - docs: devonthink-web-bridge.md 설치 가이드 + first-wins 정책 명시 Phase 1 closure 기준 = 재료 품질 (검색 가능 + 노이즈율 + dedup + 엔진 분포). 활용처(ai_tldr/digest/PKM 회고)는 1-2주 OR 30-50건 관찰 후 별 PR 에서 결정. Plan: ~/.claude/plans/db-snuggly-petal.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 21:17:08 +09:00
parent 118f32f9b1
commit 0cbba0ceeb
6 changed files with 601 additions and 8 deletions
@@ -17,3 +17,7 @@ python-multipart>=0.0.9
 jinja2>=3.1.0
 feedparser>=6.0.0
 pymupdf>=1.24.0
+# Web/Blog ingest (devonagent 트랙) — HTML 본문 정화 4-tier fallback
+trafilatura>=1.12.0
+readability-lxml>=0.8.1
+markdownify>=0.13.1
@@ -373,6 +373,22 @@ async def process(document_id: int, session: AsyncSession) -> None:
        logger.info(f"doc {document_id}: law_monitor → classify skip")
        return

+    # Web/Blog ingest (devonagent 트랙) — plan db-snuggly-petal.md
+    # queue_consumer override 가 classify 를 skip 시키지만, 우회 경로 (예: 수동 enqueue)
+    # 로 들어왔을 때 안전망. ai_tldr/ai_bullets 같은 LLM 가공은 별 PR (Mac mini derived-worker).
+    if doc.source_channel == "devonagent":
+        from urllib.parse import urlparse
+        if not doc.ai_domain:
+            doc.ai_domain = "Web"
+        if not doc.ai_tags:
+            host = (urlparse(doc.edit_url or "").hostname or "web").lower()
+            doc.ai_tags = [f"Web/{host}"]
+        if not doc.importance:
+            doc.importance = "medium"
+        await session.commit()
+        logger.info(f"doc {document_id}: devonagent → classify skip")
+        return
+
    if not doc.extracted_text:
        raise ValueError(f"문서 ID {document_id}: extracted_text가 비어있음")

@@ -1,5 +1,6 @@
-"""텍스트 추출 워커 — kordoc / PyMuPDF / Surya OCR / LibreOffice / 직접 읽기"""
+"""텍스트 추출 워커 — kordoc / PyMuPDF / Surya OCR / LibreOffice / 직접 읽기 / 웹 HTML"""

+import hashlib
 import re
 import subprocess
 from datetime import datetime, timezone
@@ -101,6 +102,137 @@ async def _call_ocr(file_path: Path, is_image: bool, max_pages: int = 200) -> st
    return None


+# ─── Web/Blog ingest (devonagent 트랙) — HTML → markdown 4-tier ────────────
+
+_WEB_MIN_BODY_LEN = 200  # 4-tier fallback 전환 임계
+
+
+def _extract_web_with_trafilatura(html: str) -> tuple[str, str | None]:
+    """trafilatura 로 본문 markdown 추출. (body, engine_version) 반환. 실패 시 ("", None)."""
+    try:
+        import trafilatura
+    except ImportError:
+        logger.warning("[web] trafilatura 미설치 — 다음 fallback 시도")
+        return "", None
+    try:
+        body = trafilatura.extract(
+            html,
+            output_format="markdown",
+            include_comments=False,
+            include_tables=True,
+            with_metadata=True,
+            deduplicate=True,
+            favor_precision=True,
+        )
+        return (body or "", getattr(trafilatura, "__version__", "unknown"))
+    except Exception as e:
+        logger.warning(f"[web] trafilatura 실패: {e}")
+        return "", None
+
+
+def _extract_web_with_readability(html: str) -> tuple[str, str | None]:
+    """readability-lxml 로 본문 추출 + markdownify 로 markdown 변환."""
+    try:
+        from readability import Document as ReadabilityDocument
+        from markdownify import markdownify
+    except ImportError:
+        logger.warning("[web] readability/markdownify 미설치 — 다음 fallback 시도")
+        return "", None
+    try:
+        rd = ReadabilityDocument(html)
+        body_html = rd.summary() or ""
+        if not body_html:
+            return "", None
+        body_md = markdownify(body_html, heading_style="ATX")
+        return (body_md or "", "readability+markdownify")
+    except Exception as e:
+        logger.warning(f"[web] readability 실패: {e}")
+        return "", None
+
+
+def _extract_web_with_bs4(html: str) -> tuple[str, str | None]:
+    """최종 fallback — BeautifulSoup 으로 script/style 제거 후 get_text."""
+    try:
+        from bs4 import BeautifulSoup
+    except ImportError:
+        logger.warning("[web] beautifulsoup4 미설치 — 빈 본문 반환")
+        return "", None
+    try:
+        soup = BeautifulSoup(html, "lxml")
+        for tag in soup(["script", "style", "noscript", "nav", "footer", "aside"]):
+            tag.decompose()
+        text = soup.get_text(" ", strip=True)
+        return (text or "", "bs4_text")
+    except Exception as e:
+        logger.warning(f"[web] bs4 실패: {e}")
+        return "", None
+
+
+async def _extract_web_html(doc: Document, html_path: Path) -> None:
+    """devonagent HTML → markdown 4-tier fallback. md_* 컬럼 전체 채움."""
+    html_bytes = html_path.read_bytes()
+    html_text = html_bytes.decode("utf-8", errors="replace")
+    src_hash = hashlib.sha256(html_bytes).hexdigest()
+
+    # 1) trafilatura
+    body, engine_ver = _extract_web_with_trafilatura(html_text)
+    engine = "trafilatura" if body and len(body) >= _WEB_MIN_BODY_LEN else None
+
+    # 2) sibling .md (DEVONthink rendered)
+    if not engine:
+        md_path = html_path.with_suffix(".md")
+        if md_path.is_file():
+            try:
+                md_body = md_path.read_text(encoding="utf-8", errors="replace")
+                if md_body and len(md_body) >= _WEB_MIN_BODY_LEN:
+                    body = md_body
+                    engine = "devonthink_export"
+                    engine_ver = "smart_rule"
+            except Exception as e:
+                logger.warning(f"[web] sibling .md 읽기 실패 {md_path}: {e}")
+
+    # 3) readability + markdownify
+    if not engine:
+        body2, ver2 = _extract_web_with_readability(html_text)
+        if body2 and len(body2) >= _WEB_MIN_BODY_LEN:
+            body = body2
+            engine = "readability"
+            engine_ver = ver2
+
+    # 4) bs4 get_text (최종 fallback)
+    if not engine:
+        body3, ver3 = _extract_web_with_bs4(html_text)
+        if body3:
+            body = body3
+            engine = "bs4_text"
+            engine_ver = ver3
+        else:
+            body = ""
+            engine = "empty"
+            engine_ver = None
+
+    clean_body = (body or "").replace("\x00", "")
+    now = datetime.now(timezone.utc)
+
+    doc.extracted_text = clean_body
+    doc.extracted_at = now
+    doc.extractor_version = f"web@{engine}"
+    doc.md_content = clean_body
+    doc.md_status = "ready" if clean_body else "failed"
+    doc.md_extraction_engine = engine
+    doc.md_extraction_engine_version = engine_ver
+    doc.md_format_version = "1.0"
+    doc.md_generated_at = now
+    doc.md_source_hash = src_hash
+    doc.md_content_hash = hashlib.sha256(clean_body.encode("utf-8")).hexdigest()
+    doc.content_origin = "extracted"
+
+    # extract_meta 의 web_meta 는 file_watcher 가 박은 그대로 유지 (sidecar 출처)
+    logger.info(
+        f"[web/{engine}] {doc.file_path} ({len(clean_body)}자, engine_ver={engine_ver})"
+    )
+
+
 # ─── 메인 처리 ───

 async def process(document_id: int, session: AsyncSession) -> None:
@@ -112,6 +244,19 @@ async def process(document_id: int, session: AsyncSession) -> None:
    fmt = doc.file_format.lower()
    full_path = Path(settings.nas_mount_path) / doc.file_path

+    # ─── Web/Blog ingest (devonagent 트랙) — HTML 본문 정화 4-tier fallback ───
+    # plan: ~/.claude/plans/db-snuggly-petal.md
+    # 1) trafilatura (markdown body)
+    # 2) sibling .md (DEVONthink rendered, >= 200 char)
+    # 3) readability-lxml + markdownify
+    # 4) BeautifulSoup get_text
+    # md_extraction_engine 으로 어느 경로로 추출됐는지 기록 → 품질 모니터링용
+    if fmt == "html" and doc.source_channel == "devonagent":
+        if not full_path.exists():
+            raise FileNotFoundError(f"파일 없음: {full_path}")
+        await _extract_web_html(doc, full_path)
+        return
+
    # ─── 텍스트 파일 — 직접 읽기 ───
    if fmt in TEXT_FORMATS:
        if not full_path.exists():
@@ -1,4 +1,4 @@
-"""파일 감시 워커 — Inbox/Recordings/Videos 스캔, 새/변경 파일 자동 등록.
+"""파일 감시 워커 — PKM(Inbox/Recordings/Videos) + Web(devonagent) 스캔, 자동 등록.

 §3 확장:
  - 스캔 대상: PKM/Inbox (문서) + PKM/Recordings (오디오) + PKM/Videos (비디오)
@@ -8,9 +8,19 @@
  - Roon 음원 경로(prefix match) skip — settings.roon_library_path
  - 파이프 분기: audio → stage='stt', video direct-play → stage='thumbnail',
    video quarantine → stage 없음 (처리 안 함, UI 에서 재생 불가 안내)
+
+Web/Blog ingest (devonagent 트랙, plan db-snuggly-petal.md):
+  - 스캔 대상: NAS/Web/{domain}/{YYYY-MM-DD}/{slug}.{html,md,json}
+  - DEVONthink Smart Rule 이 3종 export → 여기서 .html 만 진입 (sidecar 는 메타 소스)
+  - source_channel='devonagent', dedup = file_hash = sha256(canonical_url)
+  - first-wins 정책: 같은 canonical_url 재저장은 ingest 안 함
+  - sidecar (.json) 누락 시: skip 안 하고 ingest, web_meta.sidecar_missing=true
 """

+import hashlib
+import json
 from pathlib import Path
+from urllib.parse import parse_qsl, urlencode, urlparse, urlunparse

 from sqlalchemy import select

@@ -34,7 +44,14 @@ VIDEO_QUARANTINE_EXTS = {".mov", ".mkv", ".avi"}     # 변환 필요, 보관만
 # library (외부 작성 학습 자료) 폴더 — md/pdf/docx 등 문서 확장자만 수락
 LIBRARY_DOC_EXTS = {".md", ".pdf", ".docx", ".doc", ".txt", ".rtf", ".html", ".odt"}

-# 스캔 대상: (하위경로, 예상 category) — None 은 문서함(카테고리 미지정)
+# Web ingest — canonical URL 정규화 시 strip 할 추적 파라미터
+TRACKING_PARAMS = {
+    "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
+    "fbclid", "gclid", "msclkid", "ref", "ref_src", "ref_url", "mc_cid", "mc_eid",
+}
+
+# 스캔 대상: (PKM 상대 하위경로, 예상 category) — None 은 문서함(카테고리 미지정)
+# 모든 PKM 스캔은 source_channel='drive_sync'. Web 트랙은 별도 처리 (watch_inbox 안).
 SCAN_TARGETS: list[tuple[str, str | None]] = [
    ("Inbox", None),
    ("Recordings", "audio"),
@@ -95,10 +112,109 @@ def _route_media(path: Path, expected_category: str | None) -> tuple[str | None,
    return (None, False, "extract")


+# ─── Web/Blog ingest (devonagent 트랙) 헬퍼 ──────────────────────────────────
+
+def _canonicalize_url(url: str) -> str:
+    """URL 정규화 — UTM/fbclid/fragment/trailing-slash 제거. dedup 의 진짜 기준.
+
+    같은 글의 utm 변형 (`?utm_source=foo`) 과 fragment 변형 (`#section`) 을
+    한 row 로 수렴시키기 위해 file_hash 산출 전 반드시 거친다.
+    """
+    if not url:
+        return ""
+    try:
+        p = urlparse(url.strip())
+        clean_qs = [
+            (k, v) for k, v in parse_qsl(p.query, keep_blank_values=True)
+            if k.lower() not in TRACKING_PARAMS
+        ]
+        clean_qs.sort()
+        path = p.path.rstrip("/") or "/"
+        netloc = p.netloc.lower()
+        return urlunparse((p.scheme.lower(), netloc, path, "", urlencode(clean_qs), ""))
+    except Exception:
+        return url.strip()
+
+
+def _load_web_sidecar(html_path: Path) -> dict | None:
+    """sibling .json sidecar 읽기. 부재/파싱실패 시 None."""
+    json_path = html_path.with_suffix(".json")
+    if not json_path.is_file():
+        return None
+    try:
+        return json.loads(json_path.read_text(encoding="utf-8", errors="replace"))
+    except Exception as e:
+        logger.warning(f"[devonagent] sidecar parse 실패 {json_path}: {e}")
+        return None
+
+
+async def _ingest_web_file(session, file_path: Path, rel_path: str) -> tuple[int, int]:
+    """devonagent 트랙: .html 1건을 documents row + extract enqueue 로 등록.
+
+    - .md/.json 은 sidecar 라 caller 가 skip (여기 진입 안 함)
+    - sidecar (.json) 있으면: canonical_url 기반 dedup, web_meta 풍부
+    - sidecar 없으면: ingest 하되 web_meta.sidecar_missing=true (조용한 누락 방지)
+    - first-wins: 같은 canonical_url 재저장 시 변경 ingest 안 함
+    """
+    sidecar = _load_web_sidecar(file_path)
+    if sidecar and sidecar.get("url"):
+        raw_url = str(sidecar["url"])
+        canonical_url = _canonicalize_url(raw_url)
+        fhash = hashlib.sha256(canonical_url.encode("utf-8")).hexdigest()
+        title = str(sidecar.get("title") or file_path.stem)
+        web_meta = {
+            "raw_url": raw_url,
+            "devonthink_uuid": sidecar.get("devonthink_uuid"),
+            "pub_date": sidecar.get("pub_date"),
+            "author": sidecar.get("author"),
+            "source_agent": sidecar.get("source_agent"),
+        }
+        edit_url = canonical_url
+    else:
+        canonical_url = None
+        fhash = hashlib.sha256(f"NO_URL:{rel_path}".encode("utf-8")).hexdigest()
+        title = file_path.stem
+        web_meta = {"sidecar_missing": True}
+        edit_url = None
+
+    # devonagent dedup: file_path OR file_hash (URL identity 우선, path re-slug 흡수)
+    result = await session.execute(
+        select(Document).where(
+            (Document.file_path == rel_path) | (Document.file_hash == fhash)
+        )
+    )
+    existing = result.scalar_one_or_none()
+    if existing is not None:
+        # first-wins: 변경 ingest 안 함 (Phase 1 정책. 업데이트는 별 PR)
+        return (0, 0)
+
+    doc = Document(
+        file_path=rel_path,
+        file_hash=fhash,
+        file_format="html",
+        file_size=file_path.stat().st_size,
+        file_type="immutable",
+        title=title,
+        source_channel="devonagent",
+        category="document",
+        data_origin="external",
+        import_source="devonthink",
+        edit_url=edit_url,
+        extract_meta={"web_meta": web_meta},
+    )
+    session.add(doc)
+    await session.flush()
+    await enqueue_stage(session, doc.id, "extract")
+    return (1, 0)
+
+
 async def watch_inbox():
-    """PKM 하위 디렉토리를 스캔하여 새/변경 파일을 DB 등록 + 파이프 투입."""
-    pkm_root = Path(settings.nas_mount_path) / "PKM"
-    if not pkm_root.exists():
+    """PKM 하위 디렉토리 + Web/ 를 스캔하여 새/변경 파일을 DB 등록 + 파이프 투입."""
+    nas_root = Path(settings.nas_mount_path)
+    pkm_root = nas_root / "PKM"
+    web_root = nas_root / "Web"
+
+    if not pkm_root.exists() and not web_root.exists():
        return

    new_count = 0
@@ -111,6 +227,16 @@ async def watch_inbox():
        targets.append((extra_path, "library"))

    async with async_session() as session:
+        # ─── Web/ 트랙 (devonagent) — DEVONthink Smart Rule 이 떨군 .html 만 진입 ───
+        if web_root.exists():
+            for file_path in web_root.rglob("*.html"):
+                if not file_path.is_file() or should_skip(file_path):
+                    continue
+                rel_path = str(file_path.relative_to(nas_root))
+                added, _ = await _ingest_web_file(session, file_path, rel_path)
+                new_count += added
+
+        # ─── PKM 트랙 (기존 drive_sync) ─────────────────────────────────────────
        for sub, expected_category in targets:
            scan_root = pkm_root / sub
            if not scan_root.exists():
@@ -129,7 +255,7 @@ async def watch_inbox():
                if category is None and next_stage is None:
                    continue

-                rel_path = str(file_path.relative_to(Path(settings.nas_mount_path)))
+                rel_path = str(file_path.relative_to(nas_root))
                fhash = file_hash(file_path)

                result = await session.execute(
@@ -103,13 +103,36 @@ async def enqueue_next_stage(document_id: int, current_stage: str):
    §3 추가:
      stt → [classify]  (audio 는 extract 건너뛰고 stt 가 extracted_text 를 채움)
      thumbnail → [] (video 는 leaf — classify/embed 없음)
+
+    Web/Blog ingest (devonagent 트랙) — plan db-snuggly-petal.md:
+      source_channel='devonagent' 인 doc 의 extract 완료 시
+      classify/preview/markdown 전부 SKIP → [embed, chunk] 만 enqueue.
+      AI 가공 (ai_tldr/ai_bullets 등) 은 별 PR (Mac mini derived-worker).
    """
+    # source_channel-aware override (extract stage 만). source_channel 누락 시 _default.
+    extract_override_by_channel = {
+        "devonagent": ["embed", "chunk"],
+    }
+
    next_stages = {
        "extract": ["classify", "preview"],
        "classify": ["embed", "chunk", "markdown"],
        "stt": ["classify"],
    }
-    stages = next_stages.get(current_stage, [])
+
+    # extract 의 경우만 doc.source_channel 을 lookup 해서 override 적용
+    if current_stage == "extract":
+        from models.document import Document
+        async with async_session() as lookup_session:
+            doc = await lookup_session.get(Document, document_id)
+            sc = doc.source_channel if doc else None
+        if sc in extract_override_by_channel:
+            stages = extract_override_by_channel[sc]
+        else:
+            stages = next_stages.get(current_stage, [])
+    else:
+        stages = next_stages.get(current_stage, [])
+
    if not stages:
        return

@@ -0,0 +1,279 @@
+# DEVONthink → Document Server Web Bridge (devonagent 트랙)
+
+DEVONagent / DEVONthink 가 발견·저장한 웹페이지를 Document Server 의 검색 가능한 재료로
+보내기 위한 수동 설치 가이드. Plan: `~/.claude/plans/db-snuggly-petal.md`.
+
+## 흐름
+
+```
+DEVONagent (smart agent — 사용자 운영)
+        ↓
+DEVONthink Inbox / tagged group (web/ingest)
+        ↓  Smart Rule (AppleScript)
+NAS /volume4/Document_Server/Web/{domain}/{YYYY-MM-DD}/{slug}.{html,md,json}
+        ↓  NFS → GPU file_watcher (5분 간격)
+documents row (source_channel='devonagent') + extract → embed → chunk
+        ↓
+/api/search + bge-reranker-v2-m3 검색 가능 상태
+```
+
+## 정책 (Phase 1)
+
+- **첫 ingest 만 유지 (first-wins)**: 같은 `canonical_url` 은 한 번만 documents row 생성.
+  DEVONthink 에서 같은 글을 다시 저장해도 **내용이 갱신되지 않는다**. UTM 파라미터 변형
+  (`?utm_source=foo`) 과 fragment (`#section`) 도 정규화되어 한 row 로 수렴.
+  업데이트 버전 관리는 추후 별 PR (`PR-Web-Update-Policy`) 에서 다룬다.
+- **AI 가공 미적용**: 이 단계는 "검색 가능한 재료" 까지만. ai_tldr / ai_bullets / 카테고리
+  자동 태깅은 별 PR (Mac mini derived-worker) 에서 결정.
+- **Sidecar (.json) 누락 시**: skip 안 하고 ingest. `extract_meta.web_meta.sidecar_missing=true`
+  로 표시. URL 정보가 없어 검색 evidence 가치는 줄지만 침묵 누락보다 낫다.
+
+## NAS 경로 규칙
+
+```
+/volume4/Document_Server/Web/
+    ├── example.com/
+    │   ├── 2026-05-15/
+    │   │   ├── sample-post.html        # 본문 HTML
+    │   │   ├── sample-post.md          # DEVONthink rendered markdown (fallback 용)
+    │   │   └── sample-post.json        # 메타 sidecar
+    │   └── 2026-05-14/
+    │       └── another-post.html
+    └── ...
+```
+
+- 도메인: `urlparse(url).hostname` 의 lowercase
+- 날짜: `creation date` 의 `YYYY-MM-DD` (KST 또는 UTC, 일관 유지)
+- slug: 파일명 안전한 형태로 변환 (영숫자/하이픈/언더스코어만)
+
+## Sidecar JSON 스키마
+
+```json
+{
+  "title": "Sample Blog Post Title",
+  "url": "https://example.com/sample-post?utm_source=newsletter#main",
+  "author": "Author Name",
+  "pub_date": "2026-05-15T09:00:00Z",
+  "devonthink_uuid": "DEADBEEF-1234-5678-90AB-CDEF12345678",
+  "source_agent": "web-ingest"
+}
+```
+
+- `title`, `url` **필수** (둘 다 없으면 sidecar_missing 처리)
+- `pub_date` 는 ISO 8601 UTC 권장 (한국 시간이면 명시적 +09:00)
+- `source_agent` 는 어떤 smart agent 가 수집했는지 (분석용 메타, 옵션)
+
+## DEVONthink Smart Rule 설치
+
+### 1. Smart Rule 생성
+
+DEVONthink 3 메뉴 → `Tools` → `Smart Rules` → `+` (새 규칙).
+
+- **Name**: `Web → NAS for GPU ingest`
+- **Trigger**:
+  - `On Adding Item to` (Inbox) — Inbox 자동 처리
+  - 또는 `On Tagging Item` — `web/ingest` 태그 붙으면 발동 (수동 큐레이션 선호 시)
+- **Conditions** (옵션):
+  - `Kind` is `WebArchive` or `HTML` or `Markdown`
+  - `URL` is not empty
+
+### 2. Action: `Execute Script`
+
+다음 AppleScript 본문을 `Action Scripts` 영역에 붙여넣는다. NAS 경로
+`/Volumes/Document_Server` 는 macOS 가 마운트한 SMB/AFP volume 이라고 가정한다.
+(다른 mount point 면 `kBaseDir` 만 수정.)
+
+```applescript
+-- DEVONthink Smart Rule: Web → NAS for GPU ingest
+-- Plan: ~/.claude/plans/db-snuggly-petal.md
+
+property kBaseDir : "/Volumes/Document_Server/Web"
+
+on slugify(theText)
+    set theResult to ""
+    repeat with c in theText
+        set ch to c as string
+        set asciiVal to (id of ch)
+        if (asciiVal ≥ 48 and asciiVal ≤ 57) or ¬
+           (asciiVal ≥ 65 and asciiVal ≤ 90) or ¬
+           (asciiVal ≥ 97 and asciiVal ≤ 122) or ¬
+           ch is "-" or ch is "_" then
+            set theResult to theResult & ch
+        else if ch is " " or ch is "." or ch is "/" then
+            set theResult to theResult & "-"
+        end if
+    end repeat
+    if theResult is "" then set theResult to "untitled"
+    if (length of theResult) > 80 then ¬
+        set theResult to text 1 thru 80 of theResult
+    return theResult
+end slugify
+
+on hostnameFromURL(theURL)
+    try
+        set delim to "://"
+        set AppleScript's text item delimiters to delim
+        set tail to text item 2 of theURL
+        set AppleScript's text item delimiters to "/"
+        set host to text item 1 of tail
+        set AppleScript's text item delimiters to ""
+        -- strip port + 소문자
+        set AppleScript's text item delimiters to ":"
+        set host to text item 1 of host
+        set AppleScript's text item delimiters to ""
+        return do shell script "echo " & quoted form of host & " | tr 'A-Z' 'a-z'"
+    on error
+        return "unknown"
+    end try
+end hostnameFromURL
+
+on isoDate(theDate)
+    set y to year of theDate as string
+    set m to month of theDate as integer
+    set d to day of theDate as integer
+    if m < 10 then set m to "0" & m
+    if d < 10 then set d to "0" & d
+    return y & "-" & m & "-" & d
+end isoDate
+
+on performSmartRule(theRecords)
+    tell application id "DNtp"
+        repeat with theRecord in theRecords
+            try
+                set theURL to URL of theRecord
+                if theURL is missing value or theURL is "" then
+                    log message "Web→NAS: URL 없음, skip — " & (name of theRecord)
+                    -- continue
+                else
+                    set theName to name of theRecord
+                    set theUUID to uuid of theRecord
+                    set theAuthor to ""
+                    try
+                        set theAuthor to (meta data of theRecord)'s |author|
+                    end try
+                    set theDate to (creation date of theRecord)
+                    set dateStr to my isoDate(theDate)
+                    set host to my hostnameFromURL(theURL)
+                    set slug to my slugify(theName)
+
+                    set targetDir to kBaseDir & "/" & host & "/" & dateStr
+                    do shell script "mkdir -p " & quoted form of targetDir
+
+                    set htmlPath to targetDir & "/" & slug & ".html"
+                    set mdPath   to targetDir & "/" & slug & ".md"
+                    set jsonPath to targetDir & "/" & slug & ".json"
+
+                    -- 1) HTML export
+                    try
+                        export record theRecord to htmlPath as HTML
+                    on error errMsg
+                        log message "Web→NAS HTML export 실패 (" & theName & "): " & errMsg
+                    end try
+
+                    -- 2) Markdown export (DEVONthink rendered, trafilatura fallback)
+                    try
+                        export record theRecord to mdPath as markdown
+                    end try
+
+                    -- 3) JSON sidecar
+                    set pubISO to do shell script ¬
+                        "date -u +%Y-%m-%dT%H:%M:%SZ -r " & ¬
+                        (do shell script "stat -f %m " & quoted form of htmlPath)
+                    set jsonText to "{" & ¬
+                        "\"title\":" & my jsonEsc(theName) & "," & ¬
+                        "\"url\":" & my jsonEsc(theURL) & "," & ¬
+                        "\"author\":" & my jsonEsc(theAuthor) & "," & ¬
+                        "\"pub_date\":\"" & pubISO & "\"," & ¬
+                        "\"devonthink_uuid\":\"" & theUUID & "\"," & ¬
+                        "\"source_agent\":\"smart-rule:web-ingest\"" & ¬
+                        "}"
+                    do shell script "cat > " & quoted form of jsonPath & ¬
+                        " <<'EOF'" & linefeed & jsonText & linefeed & "EOF"
+
+                    log message "Web→NAS: " & theName & " → " & host & "/" & dateStr
+                end if
+            on error errMsg
+                log message "Web→NAS 처리 실패: " & errMsg
+            end try
+        end repeat
+    end tell
+end performSmartRule
+
+on jsonEsc(theText)
+    if theText is missing value then return "\"\""
+    set s to theText as string
+    -- 최소 escape: backslash 와 따옴표
+    set AppleScript's text item delimiters to "\\"
+    set parts to text items of s
+    set AppleScript's text item delimiters to "\\\\"
+    set s to parts as string
+    set AppleScript's text item delimiters to "\""
+    set parts to text items of s
+    set AppleScript's text item delimiters to "\\\""
+    set s to parts as string
+    set AppleScript's text item delimiters to ""
+    return "\"" & s & "\""
+end jsonEsc
+```
+
+**참고**: 위 스크립트는 시작점이다. 실제 사용 시 다음을 점검하라.
+
+- `kBaseDir` 경로가 실제 NAS mount 와 일치하는지
+- `creation date` 가 글의 실제 발행일이 아닐 수 있음 (DEVONthink 가 저장한 시점) —
+  필요하면 `meta data → date` 사용
+- JSON escape 가 한국어/특수문자에서 깨지는지 → `do shell script "python3 -c ..."` 로
+  대체하는 게 안전
+
+### 3. 동작 확인
+
+1. DEVONthink 에서 웹페이지를 Inbox 에 저장 (단축키 `^⌥⌘)` 또는 Clip to DEVONthink)
+2. Smart Rule 이 자동 발동 (혹은 우클릭 → `Apply Rule`)
+3. `/Volumes/Document_Server/Web/{host}/{date}/{slug}.{html,md,json}` 3종 생성 확인
+4. 최대 5분 내 GPU file_watcher 가 ingest. SQL 확인:
+   ```sql
+   SELECT id, title, edit_url, md_extraction_engine, md_status
+   FROM documents WHERE source_channel='devonagent'
+   ORDER BY created_at DESC LIMIT 5;
+   ```
+
+## file_watcher 동작 요약
+
+- `nas_mount_path / "Web"` 하위를 5분 간격 rglob 으로 `.html` 만 수집
+- 각 `.html` 마다 sibling `.json` 읽어 canonical URL 산출
+- `file_hash = sha256(canonical_url)` → URL identity dedup
+- documents row 생성 + `processing_queue.stage='extract'` 등록
+- extract_worker 의 4-tier fallback 으로 md_content 채움
+- `source_channel='devonagent'` 인 doc 은 `classify`/`preview`/`markdown` SKIP →
+  `embed` + `chunk` 만 enqueue
+
+## 검증 (운영 후)
+
+```sql
+-- 도메인 분포 (어느 사이트가 많이 들어오는지)
+SELECT split_part(edit_url, '/', 3) host, count(*) cnt
+FROM documents WHERE source_channel='devonagent' AND edit_url IS NOT NULL
+GROUP BY host ORDER BY cnt DESC;
+
+-- 추출 엔진 분포 (bs4_text 비율 모니터링)
+SELECT md_extraction_engine, count(*) cnt,
+       ROUND(100.0 * count(*) / sum(count(*)) OVER (), 1) pct
+FROM documents WHERE source_channel='devonagent'
+GROUP BY md_extraction_engine ORDER BY cnt DESC;
+
+-- Sidecar 누락 분 (조용한 누락 가시화)
+SELECT id, title, file_path
+FROM documents
+WHERE source_channel='devonagent'
+  AND extract_meta->'web_meta'->>'sidecar_missing' = 'true';
+```
+
+## 알려진 한계 (Phase 1)
+
+- **JS-rendered 페이지**: SPA / React / Vue 로 본문이 client-side 렌더되는 사이트는
+  HTML 안에 본문 텍스트가 없어 trafilatura 가 빈 결과를 낸다. DEVONthink WebArchive
+  export 가 렌더 결과를 잡아주면 OK, 아니면 bs4_text fallback 도 빈약하다.
+  Playwright 컨테이너는 별 PR.
+- **로그인/페이월 콘텐츠**: DEVONthink 가 로그인 세션으로 capture 한 경우만 본문 보유.
+- **canonical_url 정책**: 같은 글의 reprint (Medium → 본인 블로그) 는 다른 row 로 ingest 됨.
+  URL identity 만 dedup 기준이다.
+- **첫 ingest 만 유지**: 글이 후속 편집되어도 갱신 안 됨. 별 PR 에서 정책 결정.