Compare commits

...

16 Commits

Author SHA1 Message Date
hyungi 988631fdb6 feat(documents): 3-pane 중앙 리더에 절 목차 rail + 점프 + scroll-spy
[id] 전체보기에만 있던 개요 rail/점프를 메인 /documents 3-pane 중앙 리더로 확장
(사용자 주 사용 표면). 경로 A anchor 인프라 그대로 재사용.

- /documents/{id}/sections fetch(loadSections, doc.id 가드) → 좌측 SectionOutline rail
  (showRail = 표시가능 절 有 + markdown-ish 본문). window 빈제목 31% 노이즈는 outlineSections
  필터로 표시 제외(클린업, 코퍼스 무터치).
- anchorMap = buildAnchorMap(mdRenderText, sections) — 각 분기가 실제 렌더하는 텍스트 기준.
  MarkdownDoc(markdown/pdf/hwp/article)에 anchorMap 전달 → <span id=sec-N> splice.
- jumpTo = scrollEl 내 #sec-{id} scrollIntoView. scroll-spy = scrollEl scroll 리스너로
  상단 통과 마지막 .md-anchor → activeKey(SectionOutline 강조). $effect cleanup.
- 본문을 [rail | scrollEl] flex 로 래핑(비-섹션 문서는 rail 미표시=기존 그대로). pdf 분기는
  자체 overflow 제거하고 scrollEl 단일 스크롤로 정리(iframe h-[80vh]).

id↔id 점프라 중복제목·비-ATX 정확, anchor 없는 절=비활성(폴백). FE only, BE 무변.
vite build + node test 10/10 + lint:tokens(신규0) PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 21:26:08 +09:00
hyungi 5c065e6bec feat(documents): 개요 점프 결선 — anchor splice + id↔id 점프 + scroll-spy ([id])
불만② 개요→본문 점프를 deterministic 하게 결선(경로 A). 상세페이지([id], 개요 rail 보유).

- MarkdownDoc: anchorMap prop 추가 → 렌더 전 md_content 의 각 offset(내림차순)에
  <span id="sec-{chunkId}" class="md-anchor"> splice(점프 타깃). DOMPurify span+id+class 통과.
- SectionOutline: onJump(chunkId)/activeKey prop. 클릭=아코디언 toggle + onJump(점프).
  activeKey 일치 항목 좌측 accent border 강조(scroll-spy).
- [id]: anchorMap=buildAnchorMap(md_content, sections)(canShowMarkdown 시) → MarkdownDoc 전달.
  jumpToSection=#sec-id scrollIntoView. scroll-spy(window scroll, 120px 상단 통과 마지막 anchor).
  SectionOutline 양쪽(xl rail·details)에 onJump/activeKey 배선.

id↔id 직매칭이라 중복제목(표-1·Part UW 814건)·비-ATX(제N조) 정확. anchor 없는 절=점프
비활성(아코디언 폴백). node test 10/10, vite build + lint:tokens(신규0) PASS.
다음 = 3-pane(DocumentViewer) 개요 rail(commit 3, 레이아웃).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 20:17:07 +09:00
hyungi e1a047c2c2 feat(documents): 개요 점프 anchorMap 유틸 (forward-cursor 3중 방어)
불만② 개요→본문 점프의 deterministic anchor 좌표 산출(경로 A, FE-only).
게이트 측정상 textContent 매칭은 중복 63%·비-ATX 로 5% + silent 오점프 → md_content
에서 각 절 heading 라인 offset 을 찾아 <a id="sec-{chunk_id}"> 주입 좌표를 만든다.

★ false-early-match 방어 3중 (적대 리뷰 반영):
- 라인-시작(전체-라인) 매칭 → 본문 중간 상호참조("see Part UW")는 라인 전체가 제목과
  같지 않아 제외(forward-cursor 가 못 막던 핵심 구멍).
- 전체 매칭 + truncation(builder [:200]) 처리 → '제1조'가 '제1조의2' 오매칭 차단.
- 단조 커서 + 코드펜스 회피 → 역행/펜스 매칭 거부 = anchor 없음(점프 비활성, 오점프 금지).

window/section_split 조각·빈 제목은 skip. node test 10/10 PASS(상호참조 선행·중복 단조·
prefix·평문 제N조·펜스·window·miss·heading_path fallback). 순수 함수, vite build PASS.
다음 commit = MarkdownDoc splice + SectionOutline 점프 + DocumentViewer rail/scroll-spy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 20:11:00 +09:00
hyungi 2c77b3b0e7 Merge pull request 'feat(documents): 3-pane 중앙 리더 markdown-first 일원화 (DocumentViewer)' (#30) from feat/documents-viewer-unify into main
Reviewed-on: #30
2026-06-08 15:55:18 +09:00
hyungi 360871e9cf feat(documents): 3-pane 중앙 리더 markdown-first 일원화 (DocumentViewer)
메인 /documents 3-pane 의 중앙 리더(DocumentViewer)가 md_content 를 안 쓰고
PDF=raw iframe·md/txt=plain marked(extracted_text)만 렌더하던 이원화 제거.
"전부 MD화" 한 canonical markdown 이 전체보기 없이 메인에서 바로 보이게 함(불만①).

- viewerType.ts 신설: 분류 단일 source(상세페이지와 공유 예정, drift 차단).
  csv/json/xml/html→text(<pre>, 콤마 뭉침 회피), office→preview-pdf, hwp→hwp-markdown.
- DocumentViewer: 자체 getViewerType/renderMd(본문) 제거 → viewerType.ts + MarkdownDoc.
  - pdf: canShowMarkdown(isMdSuccess+md_content) 시 MarkdownDoc 기본 + [Markdown|PDF원본]
    토글 + MarkdownStatusBadge, 아니면 PDF iframe. lastDocId 가드는 fullDoc.id(prop) 키잉.
  - markdown(md/txt): MarkdownDoc(extracted_text=표시·편집 단일 필드), 편집 유지.
  - hwp-markdown/article: MarkdownDoc(앵커/KaTeX/이미지). 편집 미리보기만 plain marked 유지.
  - article/preview-pdf/image/text/cad/synology/unsupported 분기 보존(회귀 금지) + synology 신설.

API md_status='completed'(S1 validator live) 대응 = isMdSuccess. FE only, BE/스키마 무변.
vite build + lint:tokens(신규 위반 0) PASS. 후속: 개요 rail·안전점프(commit 2), [id] 정합(commit 3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 15:44:46 +09:00
hyungi 0f37fe6492 Merge pull request 'fix(ui): md_status 'success'/'completed' 어휘 양립 (S1 API remap 대비)' (#29) from fix/md-status-completed-compat into main
Reviewed-on: #29
2026-06-08 15:27:45 +09:00
hyungi 4042d9ec61 fix(ui): md_status 'success'/'completed' 어휘 양립 (S1 API remap 대비)
S1 backend(이미 main 머지, app/api/documents.py field_validator
_db_success_to_completed)가 직렬화 시 DB 'success'를 API 'completed'로 remap한다.
그런데 프론트 3곳이 raw 'success' 만 검사 → S1 backend 배포 시 침묵 회귀:
  - documents/[id]/+page.svelte canShowMarkdown: completed PDF가 markdown-first
    대신 raw PDF로 표시
  - documents/+page.svelte 인스펙터 칩 게이트: success 문서 칩 사라짐
  - MarkdownStatusBadge: 'completed'→default→null (성공 칩 사라짐)

DB↔API enum divergence guard: 두 어휘를 모두 성공으로 취급해야 S1 배포
전(API='success')·후(API='completed') 모두 안전. 단일 source 헬퍼로 수렴.

- lib/utils/mdStatus.ts 신설: isMdSuccess / isMdStatusVisible (raw 비교 산재 금지)
- [id] canShowMarkdown → isMdSuccess()
- documents 인스펙터 게이트 → isMdStatusVisible()
- MarkdownStatusBadge: case 'completed' 를 'success' 동의어로 추가

FE only, 백엔드/스키마/마이그레이션 무변. vite build + lint:tokens(신규 위반 0) PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 14:48:38 +09:00
hyungi c2d2a0aa4d Merge pull request 'fix(ui): 인스펙터 md상태 칩 enum 버그 (success 항상 노랑) + article suppress' (#28) from fix/md-status-chip into main
Reviewed-on: #28
2026-06-08 14:41:31 +09:00
hyungi c8d8df6b2d fix(migrations): s1 dedup 287->317 renumber (main 287=study_memo_cards 충돌 회피) 2026-06-08 03:07:53 +00:00
hyungi daf6a0ade9 feat(documents): S1 dedup·office-md·storage scaffold (B/C/D/E)
plan ds-s1-backend-1 잔여 구현 (A·C-1 은 16b0fe1):
- B 중복검사: services/dedup.py (OFF-list law_monitor 공용) + 업로드 채움(B-1)
  + GET /documents/duplicates(B-2) + post-upload near-dup 비동기(B-3)
  + backfill_dedup.py(B-4) + 야간 dedup_reconcile 잡(03:30 KST 멱등 재계산)
- C MD-first: marker_worker office/hwp 분기 _process_office(C-2) + md_status
  상태머신 postcondition success|failed(C-5) + backfill_nonpdf_markdown.py(C-4)
  + requirements markitdown
- D 스토리지: services/storage ABC+Range 계약 / LocalBackend / NasApiBackend 503
  (D-1) + /file resolver 경유, 로컬 동작 불변(D-2)
- E 운영: pre-change pg_dump + rollback_287.sql + apply runbook(E-3) + 테스트(E-1)

비파괴 불변식 유지(기존 응답 shape 무변경, md_status success→completed read-time 매핑).
어드버서리얼 리뷰 확정 1건(soft-delete canonical 승격 시 stale duplicate_of) → B-1
승격 정규화 + 야간 재계산으로 정합.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 03:05:30 +00:00
hyungi 68e2d7ea04 feat(documents): S1-ADD dedup·원본명 3컬럼 + md_status success→completed 매핑 (A) + office→md PoC (C-1)
plan ds-s1-backend-1 (r5 수렴). 코드만 스테이징 — migration 미적용(restart 보류, E-2 Soft Lock 예외창).

A (앱 v1 디코딩 비파괴 최소선):
- A-1 migrations/287_documents_dedup_fields.sql: original_filename TEXT / duplicate_of BIGINT FK ON DELETE SET NULL
  / duplicate_count INTEGER NOT NULL DEFAULT 0. 단일 statement·PG16 fast-path·BEGIN/COMMIT 금지. backfill 미포함(B-4).
- A-2 app/models/document.py: 1계층 블록에 3 mapped_column (+ ForeignKey import). md_* 는 기존.
- A-3 app/api/documents.py: DocumentResponse 3필드(duplicate_count=0 non-opt) + DocumentDetailResponse
  field_validator(success→completed, mode=before) — read-time DB→API 단방향, write(ORM) 미적용.
- A-4 tests/test_s1_dedup_shape.py: success→completed 동작 + 비-success 통과 + 3필드 디폴트/roundtrip
  + ds-app contract fixture 디코드(skip-if-absent). py_compile OK. ★ backend 절반 — 전체 비파괴는 S3 render 테스트와 AND.

C-1 PoC (워커 미연결 — C-2 에서 marker_worker 분기 연결):
- app/workers/office_md.py: OOXML=markitdown(신규 dep, lazy) / hwp·hwpx=LibreOffice headless→HTML→markdownify(기존 dep).
  실패·빈출력·타임아웃·dep부재 → OfficeMdError raise (success+빈md 금지 = C-5 postcondition 의 변환기 계약).
- scripts/poc_office_md.py: 표 fidelity 측정 하니스. E-1 = prod LibreOffice 버전핀 안전컨텍스트 실행(hwpx 필터 버전 의존).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 03:05:30 +00:00
hyungi 5a19cde38c fix(documents): 도메인 트리 카운트를 문서함 list 제외와 일치
트리(/documents/tree)는 deleted 만 제외하고 뉴스/법령/메모를 다 세는데, 문서함 list 는
source_channel news/law_monitor + file_type note 를 기본 제외 → '트리는 N건인데 클릭하면
0건' 불일치(예: Philosophy/Aesthetics 5건 전부 news+note 라 클릭 시 0). 트리 쿼리에 동일
제외 적용해 카운트=실제 표시 일치. 영향: Philosophy 12→2, General 189→84 등 정상화.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:57:47 +09:00
hyungi 7cc38e8a4a fix(ds-app): category-counts 계약 정정 — 합성된 shape 을 라이브 실측으로 재캡처
라이브 결선 첫 실로그인에서 decode 실패(Key 'total' not found) 진단:
서버 /documents/stats/category-counts 는 Pydantic response model 없는
raw dict 반환({counts:{category:n}, library_pending_suggestions}) — 초기
계약 추출('실 Pydantic 에서 추출')이 이 엔드포인트에선 shape 을 합성
(total/by_domain/review_pending/pipeline_failed = 실재하지 않음).

- CategoryCounts 모델 = 실측 shape + total 파생 접근자(counts 합)
- fixture 2사본(contract/fixtures + DSKit Resources) = CAPTURED_LIVE 재캡처
- DashboardView 스켈레톤 정합(카테고리 분포 + 한국어 라벨, 본격 재설계는 FU-E)
- CONTRACT.md 해당 행 정정 주석

전 엔드포인트 라이브 shape 전수 대조(토큰 생성 후 11종 curl + shape_diff):
stats 외 진짜 drift 0 — documents/tree·search·memos·digest·auth_me·detail·
content 일치. original_filename/duplicate_* 부재 = S1 미배포(optional 이라
무해, 배포 시 해소) / md_frontmatter·memo_task_state = JSONValue 오픈 shape
데이터 차이(무해) / duplicates 422 = S1 라우트 미배포(예상).

검증: swift test 82/82 + shape_diff (shape identical) + xcodebuild PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 00:55:59 +00:00
hyungi f1dc2e1a8d feat(ds-app): 본 서버(GPU DS) 라이브 결선 — 앱 기본을 오프라인 스캐폴드에서 라이브로 전환
- AppModel: AuthPhase 상태기계(checking/loggedOut/ready) + live() 팩토리
  (LiveDSClient + realRouter, ask 토큰 = TokenProvider 단일 소스) + bootstrap
  (refresh 쿠키 무로그인 복귀, single-shot, 취소 시 재시도 복원) + login(TOTP
  개행·공백 정규화) + 사용 중 세션 만료 시 loggedOut 강등 + 401 회전 후
  다운로드 ?token= 사본 재동기화(guarded 깔때기)
- LoginView 신규(기능 셸, 서버 host 표시, 서버 detail 메시지 노출)
- RootView: 인증 게이트 + errorText 하단 배너(no-silent-fallback 가시화)
- DSApp: 기본 .live(publicTLS=document.hyungi.net/api), DSAPP_FIXTURE=1 /
  DSAPP_DS_URL env 스위치(파싱 실패 = fail-loud, prod silent fallback 금지)
- LiveDSClient.currentAccessToken() — realRouter ask 토큰 closure 용
- AppFeatureTests 신규 10건(인증 상태기계·single-shot·transport 사유·totp)

검증: swift test 82/82 green + xcodebuild .app BUILD SUCCEEDED + 라이브
negative-path(/auth/login 401·/auth/refresh 401, 본 서버 양 경로 도달).
3-렌즈 어드버서리얼 리뷰 반영(재진입 가드/transport 구분/env fail-loud/토큰
사본 동기화/만료 강등). Sources/AI 무수정(시그니처 동결 준수).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 00:55:59 +00:00
hyungi 9ffbdc0c23 fix(ui): 모바일 가로 오버플로 제거 (min-w-0/minmax/flex-wrap/break)
flex/grid 자식이 truncate·긴 텍스트를 품으면서 min-w-0 부재 → 좁은 화면서 줄지 못해
페이지 좌우 스크롤·글자 화면 벗어남(대시보드 최근활동 타임라인이 대표 사례).
- dashboard: 타임라인 grid 1fr→minmax(0,1fr)+셀 min-w-0 / 도메인라벨·고정항목 flex-1 min-w-0(+break-words)
- inbox: 리스트 제목 min-w-0
- ask: 검색바 flex-wrap + 입력 min-w-0 + select min-w-0 max-w
- library: 트리노드·브레드크럼 min-w-0/truncate/flex-wrap
- events: 메타행 min-w-0 + project_tag break-all
- memos: 본문/code/링크 overflow-wrap:anywhere + table 가로스크롤 가드
감사 11p→수정 6p, 페이지별 적대 재스캔으로 잔존 antipattern까지 제거. 데스크탑 무회귀·토큰/이모지 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:41:57 +09:00
hyungi b6c5c133bc feat(ui): 데이터밀집 페이지 데스크탑 폭 채우기 (반응형 유동 ~1680/1240 캡)
데스크탑에서 콘텐츠가 ~1024~1400px로 가운데 몰려 좌우 공백이 크던 문제 해소.
밀집/격자/대시보드형은 max-w-[1680px], 단일컬럼 list형은 max-w-[1240px]로 확장(좌우 패딩 유지·구조 보존).
- dashboard: max-w-5xl→1680, 우측 레일 320→360px
- digest: .app max-width 1180→1680
- ask·library·audio·video: →1680  / inbox·events: →1240(events 반응형 패딩 보강)
읽기/폼(memos·settings·events상세·study reading)·신문형(news)·3-pane(documents)는 좁은 폭 유지.
감사 18p→수정 8p, 페이지별 적대 검증(토큰/이모지/반응형/오버플로/구조) 전부 PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 08:56:14 +09:00
53 changed files with 2683 additions and 221 deletions
+185 -10
View File
@@ -21,8 +21,8 @@ from fastapi import (
UploadFile,
status,
)
from fastapi.responses import FileResponse
from pydantic import BaseModel
from fastapi.responses import FileResponse, StreamingResponse
from pydantic import BaseModel, field_validator
from sqlalchemy import func, select
from sqlalchemy.ext.asyncio import AsyncSession
from starlette.requests import ClientDisconnect
@@ -30,12 +30,19 @@ from starlette.requests import ClientDisconnect
from ai.client import AIClient, _load_prompt, parse_json_response
from core.auth import get_current_user
from core.config import settings
from core.database import get_session
from core.database import async_session, get_session
from core.utils import file_hash
from models.document import Document
from models.document_image import DocumentImage
from models.queue import ProcessingQueue, enqueue_stage
from models.user import User
from services.dedup import (
DUPLICATE_GROUPS_SQL,
DEDUP_OFF_CHANNELS,
find_canonical_for_hash,
find_near_duplicates,
)
from services.storage import StorageNotConfigured, get_storage_backend
from services.document_telemetry import record_analyze_event, sanitize_source
from services.prompt_versions import ANALYZE_PROMPT_VERSION, resolve_primary_model
from services.search.llm_gate import Priority, acquire_mlx_gate
@@ -62,6 +69,53 @@ def _upload_error(status_code: int, error_code: str, message: str) -> HTTPExcept
)
async def _near_dup_scan_bg(doc_id: int) -> None:
"""B-3: post-upload near_duplicate 스캔 (BackgroundTask). 자체 세션, best-effort.
업로드 직후엔 doc.embedding 이 아직 없을 수 있어(embed stage 미완) trigram 후보만
기록되는 경우가 많다 — non-gating. 어떤 예외도 업로드 결과(201)에 영향 주지 않는다.
영속화는 보류(on-the-fly) — 현재는 로깅까지. /duplicates 의 near-dup 노출은 phase2.
"""
try:
async with async_session() as bg_session:
findings = await find_near_duplicates(bg_session, doc_id)
if findings:
top = findings[0]
logger.info(
"[dedup] near_dup_scan doc=%s candidates=%d top=%s(cosine=%s)",
doc_id, len(findings), top["doc_id"], top.get("cosine"),
)
except Exception:
logger.warning("[dedup] near_dup_scan failed doc=%s", doc_id, exc_info=True)
def _parse_byte_range(range_header: str | None, size: int) -> tuple[int | None, int | None]:
"""HTTP Range 헤더(`bytes=start-end`) 파싱 → (start, end) inclusive. 없거나 무효면 (None, None).
D-2 원격 백엔드 Range pass-through 용 (local 은 FileResponse 가 자동 처리). suffix 형식
(`bytes=-N`) 도 지원. 다중 range 는 첫 구간만.
"""
if not range_header or not range_header.startswith("bytes=") or size <= 0:
return None, None
spec = range_header[len("bytes="):].split(",")[0].strip()
if "-" not in spec:
return None, None
lo, hi = spec.split("-", 1)
try:
if lo == "": # suffix range: 마지막 N 바이트
n = int(hi)
if n <= 0:
return None, None
return max(0, size - n), size - 1
start = int(lo)
end = int(hi) if hi else size - 1
except ValueError:
return None, None
if start > end or start >= size:
return None, None
return start, min(end, size - 1)
# ─── 스키마 ───
@@ -113,6 +167,10 @@ class DocumentResponse(BaseModel):
# 회독 추적 (자료실 등) — 현재 사용자 기준. 다른 endpoint 응답에선 0/None.
read_count: int = 0
last_read_at: datetime | None = None
# S1-ADD (migration 287): 원본 파일명 + 중복검사. 앱은 옵셔널 디코딩, 없으면 폴백.
original_filename: str | None = None # 다운로드 라벨용. 없으면 file_path basename 폴백(앱 측).
duplicate_of: int | None = None # canonical doc id (자기 자신이 canonical 이면 None).
duplicate_count: int = 0 # 본인 제외 동일 판정 사본 수 (canonical 행 기준).
class Config:
from_attributes = True
@@ -140,6 +198,16 @@ class DocumentDetailResponse(DocumentResponse):
md_extraction_engine_version: str | None = None
md_generated_at: datetime | None = None
@field_validator("md_status", mode="before")
@classmethod
def _db_success_to_completed(cls, v: str | None) -> str | None:
"""DB CHECK enum 은 'success'; 계약/fixture·앱 MD-first 렌더 트리거는 'completed'.
read-time(DB→API) 단방향 매핑만 — write 경로(ORM)는 이 모델을 거치지 않아 미적용.
pending/processing/partial/failed/skipped 는 양쪽 동일하므로 'success' 만 매핑한다.
(불변식: md_status ∈ {success,partial} ⟹ md_content 非공백 = 워커 postcondition, C-5.)
"""
return "completed" if v == "success" else v
class AcceptSuggestionRequest(BaseModel):
"""§1 accept-suggestion 요청 body — stale payload / doc 수정 검출."""
@@ -192,6 +260,11 @@ async def get_document_tree(
FROM documents
WHERE ai_domain IS NOT NULL AND ai_domain != '' AND ai_domain != 'News'
AND deleted_at IS NULL
-- 문서함(list) 기본 제외와 동일하게 맞춤: 뉴스/법령 채널·메모는 문서함에 안 뜨므로
-- 트리 카운트도 제외해야 "트리 N건인데 클릭하면 0건" 불일치가 안 생긴다.
AND source_channel != 'news'
AND source_channel != 'law_monitor'
AND file_type != 'note'
GROUP BY ai_domain
ORDER BY ai_domain
""")
@@ -524,6 +597,53 @@ async def list_documents(
)
# ─── 중복검사 (dedup) — B-2 ───
# ★ 고정 path 라우트(/duplicates)는 동적 /{doc_id} 라우트보다 *위*에 등록해야 매칭 충돌이 없다.
class DuplicateGroup(BaseModel):
canonical_id: int
members: list[int]
reason: str
detail: str | None = None
class DuplicatesResponse(BaseModel):
groups: list[DuplicateGroup]
total_groups: int
total_duplicate_docs: int
@router.get("/duplicates", response_model=DuplicatesResponse)
async def list_duplicates(
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
):
"""content_hash(= file_hash exact) 중복 그룹 목록.
OFF-whitelist(law_monitor) 제외 + deleted 제외. idx_documents_hash 재사용(신규 인덱스/테이블 불요).
near_duplicate(유사도 기반) 그룹은 영속화 보류 → S1 은 exact 그룹만 노출(계약 shape 동일,
detail 문구만 'file_hash' 기준). 응답 shape = ds-app contract `documents_duplicates.json`.
"""
rows = (
await session.execute(DUPLICATE_GROUPS_SQL, {"off_channels": list(DEDUP_OFF_CHANNELS)})
).all()
groups = [
DuplicateGroup(
canonical_id=r.canonical_id,
members=list(r.members),
reason="content_hash",
detail="동일 file_hash (원본 바이트 SHA-256 일치)",
)
for r in rows
]
return DuplicatesResponse(
groups=groups,
total_groups=len(groups),
# 사본 수 = 그룹별 (멤버수-1) 합 (canonical 제외) — fixture total_duplicate_docs 정의와 동일.
total_duplicate_docs=sum(len(g.members) - 1 for g in groups),
)
@router.get("/{doc_id}", response_model=DocumentDetailResponse)
async def get_document(
doc_id: int,
@@ -682,6 +802,7 @@ async def get_document_file(
session: Annotated[AsyncSession, Depends(get_session)],
token: str | None = Query(None, description="Bearer token (iframe용)"),
download: bool = Query(False, description="true면 attachment (브라우저 다운로드)"),
range_header: str | None = Header(None, alias="Range"),
user: User | None = Depends(lambda: None),
):
"""문서 원본 파일 서빙 (Bearer 헤더 또는 ?token= 쿼리 파라미터)"""
@@ -704,9 +825,10 @@ async def get_document_file(
if not doc.file_path:
raise HTTPException(status_code=404, detail="파일이 없는 문서입니다 (메모)")
file_path = Path(settings.nas_mount_path) / doc.file_path
if not file_path.exists():
raise HTTPException(status_code=404, detail="파일을 찾을 수 없습니다")
# D-2: 물리 경로 해석을 storage 백엔드로 단일화. local=FileResponse(Range 자동) /
# 원격=ABC.stream(range). /file URL·바디 shape 불변(non-breaking). 현재 활성 백엔드는
# LocalBackend only 라 동작 변경 0.
backend = get_storage_backend()
# 미디어 타입 매핑
# HTML5 <audio>/<video> 직접 재생을 위해 audio/video mime 포함. Starlette
@@ -727,7 +849,7 @@ async def get_document_file(
# 비디오 — direct play 호환 (§3 최소판)
".mp4": "video/mp4", ".webm": "video/webm",
}
suffix = file_path.suffix.lower()
suffix = Path(doc.file_path).suffix.lower()
media_type = media_types.get(suffix, "application/octet-stream")
# Content-Disposition: download=true면 attachment (한글 filename* 호환)
@@ -739,10 +861,40 @@ async def get_document_file(
else:
disposition = "inline"
return FileResponse(
path=str(file_path),
# 로컬 백엔드: 기존과 동일하게 FileResponse (Range 자동 처리).
if backend.is_local:
local = backend.local_path(doc.file_path)
if local is None or not Path(local).exists():
raise HTTPException(status_code=404, detail="파일을 찾을 수 없습니다")
return FileResponse(
path=str(local),
media_type=media_type,
headers={"Content-Disposition": disposition},
)
# 원격 백엔드: D-1 ABC 의 Range pass-through. 미프로비전 백엔드는 stat() 가
# StorageNotConfigured → 503 (silent fallback 금지). 현재 LocalBackend only 라 미도달.
try:
st = await backend.stat(doc.file_path)
except StorageNotConfigured as exc:
raise HTTPException(status_code=503, detail=str(exc))
if not st.exists:
raise HTTPException(status_code=404, detail="파일을 찾을 수 없습니다")
start, end = _parse_byte_range(range_header, st.size)
headers = {"Content-Disposition": disposition, "Accept-Ranges": "bytes"}
if start is None:
headers["Content-Length"] = str(st.size)
status_code = 200
else:
headers["Content-Range"] = f"bytes {start}-{end}/{st.size}"
headers["Content-Length"] = str(end - start + 1)
status_code = 206
return StreamingResponse(
backend.stream(doc.file_path, start=start, end=end),
status_code=status_code,
media_type=media_type,
headers={"Content-Disposition": disposition},
headers=headers,
)
@@ -803,6 +955,7 @@ async def get_document_image_raw(
async def upload_document(
request: Request,
file: UploadFile,
background_tasks: BackgroundTasks,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
doc_purpose: str | None = Form(None, description="business | knowledge"),
@@ -954,6 +1107,9 @@ async def upload_document(
file_size=written,
file_type="immutable",
title=target.stem,
# B-1: 업로드 원본 파일명(다운로드 라벨용). file_path 는 충돌 시 _N 리네임되므로
# 원본명을 별도 보존. safe_name = Path(file.filename).name (경로 이탈 제거된 basename).
original_filename=safe_name,
source_channel="manual",
doc_purpose=doc_purpose,
user_tags=[library_tag] if library_tag else [],
@@ -964,6 +1120,22 @@ async def upload_document(
)
session.add(doc)
await session.flush()
# B-1: file_hash exact 중복 채움 (OFF-whitelist=law_monitor 제외). 거부(409) 아님 —
# 허용 + duplicate_of 링크 + canonical duplicate_count++ (법령 의도적 중복 보존 정책).
# 홈랩 저동시성이라 동시 동일-hash 업로드 TOCTOU 는 멱등/B-4 backfill 로 수습(락 불요).
canonical = await find_canonical_for_hash(session, fhash, exclude_id=doc.id)
if canonical is not None:
# 원래 canonical 이 soft-delete(deleted_at) 되어 former member 가 승격되면, 그 survivor 의
# stale duplicate_of 를 비워 'member 이자 counter' 모순을 막는다(B-4 불변식 유지). 문서는
# soft-delete only 라 FK ON DELETE SET NULL 이 발화하지 않아 잔여가 남기 때문(리뷰 발견).
# (삭제된 canonical 을 가리키는 다른 sibling 멤버의 잔여 포인터·overcount 는 야간
# dedup_reconcile 잡(B-4, 03:30 KST 멱등 절대 재계산)이 정리.)
if canonical.duplicate_of is not None:
canonical.duplicate_of = None
doc.duplicate_of = canonical.id
canonical.duplicate_count = (canonical.duplicate_count or 0) + 1
# document + processing_queue 는 단일 트랜잭션으로 묶어 원자적 정리
await enqueue_stage(session, doc.id, "extract")
await session.commit()
@@ -973,6 +1145,9 @@ async def upload_document(
target.unlink(missing_ok=True)
raise
# B-3: near_duplicate 스캔은 post-upload 비동기 — 201 응답을 막지 않는다(non-gating 기록).
background_tasks.add_task(_near_dup_scan_bg, doc.id)
return DocumentResponse.model_validate(doc)
+4
View File
@@ -48,6 +48,7 @@ async def lifespan(app: FastAPI):
from services.search.query_analyzer import prewarm_analyzer
from workers.briefing_worker import run as morning_briefing_run
from workers.daily_digest import run as daily_digest_run
from workers.dedup_reconcile import run as dedup_reconcile_run
from workers.digest_worker import run as global_digest_run
from workers.file_watcher import watch_inbox
from workers.law_monitor import run as law_monitor_run
@@ -120,6 +121,9 @@ async def lifespan(app: FastAPI):
# 이드 W3-2: 공부중 토픽 약점 derived 스냅샷 (nightly 04:30 KST, LLM 0). study_diagnosis 표면 source.
scheduler.add_job(study_weakness_run, CronTrigger(hour=4, minute=30, timezone=KST), id="study_weakness")
scheduler.add_job(news_collector_run, "interval", hours=6, id="news_collector")
# plan ds-s1-backend-1 B-4: dedup 컬럼(duplicate_of/duplicate_count) 야간 절대 재계산.
# soft-delete 잔여 드리프트 정리(멱등, 드리프트 없으면 no-op). cron 03:30 (다른 잡과 비충돌).
scheduler.add_job(dedup_reconcile_run, CronTrigger(hour=3, minute=30, timezone=KST), id="dedup_reconcile")
scheduler.start()
# Phase 2.1 (async 구조): QueryAnalyzer prewarm.
+14 -1
View File
@@ -3,7 +3,7 @@
from datetime import datetime
from pgvector.sqlalchemy import Vector
from sqlalchemy import BigInteger, Boolean, DateTime, Enum, Integer, String, Text
from sqlalchemy import BigInteger, Boolean, DateTime, Enum, ForeignKey, Integer, String, Text
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import Mapped, mapped_column
@@ -28,6 +28,19 @@ class Document(Base):
)
import_source: Mapped[str | None] = mapped_column(Text)
# 1계층: 원본명 + 중복검사 (S1-ADD, migration 287)
# original_filename = 업로드 원본 파일명(다운로드 라벨용). file_path 는 충돌 시 _N 리네임됨.
# cf. original_format(ODF 변환용) / original_path·original_hash(007 legacy dead) 와 의미 구분.
# duplicate_of = canonical doc id (자기 자신이 canonical 이면 NULL). FK ON DELETE SET NULL.
# duplicate_count = canonical 행에 담는 '본인 제외 동일 판정 사본 수' (group_size-1). 업로드/backfill 가 갱신.
original_filename: Mapped[str | None] = mapped_column(Text)
duplicate_of: Mapped[int | None] = mapped_column(
BigInteger, ForeignKey("documents.id", ondelete="SET NULL")
)
duplicate_count: Mapped[int] = mapped_column(
Integer, nullable=False, default=0, server_default="0"
)
# 2계층: 텍스트 추출
extracted_text: Mapped[str | None] = mapped_column(Text)
extracted_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True))
+3
View File
@@ -21,3 +21,6 @@ pymupdf>=1.24.0
trafilatura>=1.12.0
readability-lxml>=0.8.1
markdownify>=0.13.1
# office OOXML(docx/xlsx/pptx) → md (plan ds-s1-backend-1 C-1). hwp 는 LibreOffice+markdownify 경로.
# 정확한 핀은 E-1 markitdown OOXML PoC(devsbx/버전핀 컨텍스트)에서 확정.
markitdown[docx,xlsx,pptx]>=0.1.0
+239
View File
@@ -0,0 +1,239 @@
"""중복검사(dedup) 공용 로직 — plan ds-s1-backend-1 B 그룹.
세 소비처가 공유:
- B-1 업로드 채움 (api/documents.upload_document) → find_canonical_for_hash
- B-2 GET /documents/duplicates → DEDUP_OFF_CHANNELS (그룹 SQL 은 라우터에)
- B-4 backfill (scripts/backfill_dedup.py) → DEDUP_OFF_CHANNELS / canonical = min(id)
- B-3 near_duplicate → find_near_duplicates
OFF-whitelist (DEDUP_OFF_CHANNELS):
law_monitor = 법령 개정본을 의도적으로 별 행으로 보존(개정일 추적). file_hash 가 같아도
collapse 하면 개정 이력이 사라지므로 dedup 비참여. (P0-2 실측: dup 18그룹/36행 중
law_monitor 17그룹 = 의도된 개정 보존, manual 1그룹 = 진짜 content dedup.)
file_hash 는 이미 채널별 키를 인코딩(note=본문SHA / devonagent=URL / news=article_id)하므로
채널별 키 분기는 두지 않고 단일 OFF-list 만 데이터로 둔다(P0-2 결정).
near_duplicate (B-3):
title trigram 후보 → 후보에만 doc-level embedding 코사인 rerank. 전수 28.9k 임베딩 스캔 회피.
저장된 embedding read-only(검색실험 Soft Lock: 재생성 금지). 임계·결과는 전부 non-gating 기록값
(trigram-first recall gap = 본문동일·제목상이 near-dup 은 놓침 → phase2 ivfflat 회수 대상).
영속화는 보류(on-the-fly) — S1 은 helper + 호출부 로깅까지. duplicate_of 영속화는 exact(file_hash)만.
"""
from __future__ import annotations
import logging
from sqlalchemy import bindparam, or_, select, text
from sqlalchemy.ext.asyncio import AsyncSession
logger = logging.getLogger(__name__)
# file_hash dedup 제외 채널 (단일 OFF-whitelist). B-1/B-2/B-4 공용.
DEDUP_OFF_CHANNELS: tuple[str, ...] = ("law_monitor",)
# near_duplicate 파라미터 — 전부 기록값·non-gating (phase2 ivfflat 가 recall gap 회수).
NEAR_DUP_TRGM_THRESHOLD = 0.30 # pg_trgm title 후보 컷 (느슨 — 후보 생성용)
NEAR_DUP_COSINE_THRESHOLD = 0.95 # 후보 embedding 코사인 near-dup 판정 컷 (≈0.95~0.97)
NEAR_DUP_MAX_CANDIDATES = 50 # trigram 후보 상한 — 전수 임베딩 스캔 회피
async def find_canonical_for_hash(
session: AsyncSession, file_hash: str, *, exclude_id: int | None = None
):
"""주어진 file_hash 의 canonical 문서(가장 오래된 = min id)를 반환. 없으면 None.
OFF-whitelist 채널(law_monitor)은 canonical 후보에서 제외 → 업로드가 법령 개정본에
링크되지 않는다. exclude_id = 방금 INSERT 한 신규 행 자신 제외(B-1).
"""
from models.document import Document # 지연 import (순환 회피)
stmt = (
select(Document)
.where(
Document.file_hash == file_hash,
Document.deleted_at.is_(None),
or_(
Document.source_channel.is_(None),
Document.source_channel.notin_(DEDUP_OFF_CHANNELS),
),
)
.order_by(Document.id.asc())
)
if exclude_id is not None:
stmt = stmt.where(Document.id != exclude_id)
return (await session.execute(stmt)).scalars().first()
# B-2 /documents/duplicates 의 file_hash 그룹 SQL. 라우터가 직접 execute (Pydantic 응답은 라우터에).
# reason='content_hash' = file_hash exact 그룹(idx_documents_hash 재사용, 신규 인덱스/테이블 불요).
# canonical_id = min(id), members = id 오름차순 배열, n = 그룹 크기.
DUPLICATE_GROUPS_SQL = text(
"""
SELECT file_hash,
min(id) AS canonical_id,
array_agg(id ORDER BY id) AS members,
count(*) AS n
FROM documents
WHERE deleted_at IS NULL
AND file_hash IS NOT NULL
AND (source_channel IS NULL OR source_channel NOT IN :off_channels)
GROUP BY file_hash
HAVING count(*) > 1
ORDER BY min(id)
"""
).bindparams(bindparam("off_channels", expanding=True))
async def reconcile_dedup(
session: AsyncSession, *, apply: bool = True, chunk_size: int = 500, sample_size: int = 40
) -> dict:
"""file_hash exact 그룹의 duplicate_of/duplicate_count 를 재계산해 정합화 (B-4 코어).
멱등 — 목표값과 다른 행만 UPDATE. 야간 잡(workers.dedup_reconcile)과 backfill 스크립트가
공유한다. 문서는 soft-delete only(FK ON DELETE SET NULL 미발화) → 비정규화 dedup 컬럼이
삭제 시 드리프트(멤버의 stale 포인터·canonical overcount)하므로 절대 재계산이 정합 보장.
반환 = {groups, docs, changes, applied, sample}. sample = 적용될/된 변경 미리보기(최대 sample_size).
canonical = 그룹 최古(min id): duplicate_of=NULL, duplicate_count=group_size-1. 멤버: duplicate_of=canonical, count=0.
"""
groups = (
await session.execute(
DUPLICATE_GROUPS_SQL, {"off_channels": list(DEDUP_OFF_CHANNELS)}
)
).all()
desired: dict[int, tuple[int | None, int]] = {}
for g in groups:
members = list(g.members)
canonical = g.canonical_id
desired[canonical] = (None, len(members) - 1)
for m in members:
if m != canonical:
desired[m] = (canonical, 0)
if not desired:
return {"groups": 0, "docs": 0, "changes": 0, "applied": 0, "sample": []}
ids = list(desired.keys())
current: dict[int, tuple[int | None, int]] = {}
for i in range(0, len(ids), 1000):
batch = ids[i : i + 1000]
rows = (
await session.execute(
text(
"SELECT id, duplicate_of, duplicate_count "
"FROM documents WHERE id = ANY(:ids)"
).bindparams(ids=batch)
)
).all()
for r in rows:
current[r.id] = (r.duplicate_of, int(r.duplicate_count or 0))
changes = [
(i, dof, dcnt)
for i, (dof, dcnt) in desired.items()
if current.get(i) != (dof, dcnt)
]
sample = [
{"id": i, "duplicate_of": dof, "duplicate_count": dcnt}
for (i, dof, dcnt) in changes[:sample_size]
]
applied = 0
if apply and changes:
for i in range(0, len(changes), chunk_size):
for did, dof, dcnt in changes[i : i + chunk_size]:
await session.execute(
text(
"UPDATE documents SET duplicate_of = :dof, duplicate_count = :dcnt "
"WHERE id = :id"
).bindparams(dof=dof, dcnt=dcnt, id=did)
)
await session.commit()
applied += len(changes[i : i + chunk_size])
return {
"groups": len(groups),
"docs": len(ids),
"changes": len(changes),
"applied": applied,
"sample": sample,
}
async def find_near_duplicates(
session: AsyncSession,
doc_id: int,
*,
cosine_threshold: float = NEAR_DUP_COSINE_THRESHOLD,
trgm_threshold: float = NEAR_DUP_TRGM_THRESHOLD,
max_candidates: int = NEAR_DUP_MAX_CANDIDATES,
) -> list[dict]:
"""anchor doc 의 near-duplicate 후보를 trigram→embedding 2단계로 찾는다(read-only).
반환 = [{doc_id, title, title_sim?, cosine}] (cosine 내림차순). embedding 미생성 시
(업로드 직후 흔함) trigram 후보만 cosine=None 으로 반환(non-gating 기록). 어떤 행도
수정/삭제하지 않으며 저장된 embedding 만 읽는다(Soft Lock 준수).
"""
anchor = (
await session.execute(
text(
"SELECT id, title, (embedding IS NOT NULL) AS has_emb "
"FROM documents WHERE id = :id AND deleted_at IS NULL"
).bindparams(id=doc_id)
)
).first()
if anchor is None or not anchor.title:
return []
# (1) title trigram 후보. similarity() 컷으로 후보를 max_candidates 로 줄여 전수 임베딩
# 스캔을 회피한다. (index-accelerated `%` 연산자 경로는 후보 생성이 병목이 될 때의
# phase2 최적화 — 짧은 title 28.9k seq 평가는 비동기 post-upload 에서 충분히 저렴.)
cand_rows = (
await session.execute(
text(
"""
SELECT id, title, similarity(title, :t) AS title_sim
FROM documents
WHERE id <> :id
AND deleted_at IS NULL
AND title IS NOT NULL
AND similarity(title, :t) >= :trgm
ORDER BY similarity(title, :t) DESC
LIMIT :lim
"""
).bindparams(id=doc_id, t=anchor.title, trgm=trgm_threshold, lim=max_candidates)
)
).all()
if not cand_rows:
return []
if not anchor.has_emb:
# 임베딩 미생성 — 후보만 기록(cosine rerank 는 embed stage 완료 후). non-gating.
return [
{"doc_id": r.id, "title": r.title, "title_sim": float(r.title_sim), "cosine": None}
for r in cand_rows
]
# (2) 후보에만 doc-level embedding 코사인 rerank. 저장값 read-only.
cand_ids = [r.id for r in cand_rows]
rer = (
await session.execute(
text(
"""
SELECT c.id, c.title,
(1 - (c.embedding <=> (SELECT embedding FROM documents WHERE id = :id))) AS cosine
FROM documents c
WHERE c.id = ANY(:ids) AND c.embedding IS NOT NULL
"""
).bindparams(id=doc_id, ids=cand_ids)
)
).all()
out = [
{"doc_id": r.id, "title": r.title, "cosine": float(r.cosine)}
for r in rer
if r.cosine is not None and float(r.cosine) >= cosine_threshold
]
out.sort(key=lambda x: x["cosine"], reverse=True)
return out
+39
View File
@@ -0,0 +1,39 @@
"""스토리지 계층 추상화 패키지 (plan ds-s1-backend-1 D 그룹, scaffold-first).
활성 백엔드 선택 = get_storage_backend():
- env DS_STORAGE_BACKEND (기본 'local') 로 결정 — config.yaml storage 섹션 편집 없이도
동작(검색실험 Soft Lock 동안 config 불가침). 실 활성(외부 백엔드)은 D-3.
- 'local' → LocalBackend(settings.nas_mount_path) : 현행 NAS NFS, /file 동작 불변.
- 'nas_api'/'nas'→ NasApiBackend(env DS_NAS_API_BASE_URL) : 미프로비전 시 503(silent fallback X).
"""
from __future__ import annotations
import os
from functools import lru_cache
from core.config import settings
from .base import StatResult, StorageBackend, StorageNotConfigured
from .local import LocalBackend
from .nas_api import NasApiBackend
__all__ = [
"StorageBackend",
"StorageNotConfigured",
"StatResult",
"LocalBackend",
"NasApiBackend",
"get_storage_backend",
]
@lru_cache(maxsize=1)
def get_storage_backend() -> StorageBackend:
"""활성 스토리지 백엔드 1개 반환 (프로세스 단위 캐시)."""
backend = os.getenv("DS_STORAGE_BACKEND", "local").lower()
if backend == "local":
return LocalBackend(settings.nas_mount_path)
if backend in ("nas_api", "nas"):
return NasApiBackend(os.getenv("DS_NAS_API_BASE_URL"))
raise StorageNotConfigured(f"unknown DS_STORAGE_BACKEND={backend!r}")
+50
View File
@@ -0,0 +1,50 @@
"""스토리지 백엔드 추상 인터페이스 — plan ds-s1-backend-1 D-1.
ABC 는 첫날부터 Range(offset/length) stream 계약을 포함한다 — D-2 의 원격 streaming
Range pass-through 가 afterthought 가 아니라 인터페이스 의무가 되도록.
is_local=True 백엔드는 로컬 파일시스템 경로를 노출 → 호출부가 Starlette FileResponse
(Range 자동 처리)를 그대로 쓴다. 원격 백엔드는 stream()/stat() 로 Range 를 구현한다.
"""
from __future__ import annotations
import os
from abc import ABC, abstractmethod
from collections.abc import AsyncIterator
from dataclasses import dataclass
class StorageNotConfigured(RuntimeError):
"""활성화되지 않은(미프로비전) 백엔드 호출 — 503 으로 표면화. silent fallback 금지."""
@dataclass
class StatResult:
exists: bool
size: int
class StorageBackend(ABC):
"""원본 파일 접근 추상 인터페이스."""
# 로컬 파일시스템 경로를 노출하는가 (FileResponse 직결 가능 여부).
is_local: bool = False
@abstractmethod
def local_path(self, rel_path: str) -> os.PathLike[str] | None:
"""is_local=True 면 물리 경로 반환(FileResponse 용). 원격 백엔드는 None."""
@abstractmethod
async def stat(self, rel_path: str) -> StatResult:
"""크기/존재 여부. 미구성 백엔드는 StorageNotConfigured raise."""
@abstractmethod
def stream(
self, rel_path: str, *, start: int | None = None, end: int | None = None
) -> AsyncIterator[bytes]:
"""[start, end] 바이트 범위(inclusive)를 async 청크로 yield (Range pass-through).
start/end 가 None 이면 전체. 미구성 백엔드는 StorageNotConfigured raise.
"""
raise NotImplementedError
+50
View File
@@ -0,0 +1,50 @@
"""LocalBackend — 현행 NAS NFS(volume4) 마운트. /file 동작 불변 (plan D-1)."""
from __future__ import annotations
import os
from collections.abc import AsyncIterator
from pathlib import Path
from .base import StatResult, StorageBackend
_STREAM_CHUNK = 256 * 1024
class LocalBackend(StorageBackend):
"""루트(=settings.nas_mount_path) 하위 상대경로를 로컬 파일시스템으로 해석."""
is_local = True
def __init__(self, root: str) -> None:
self._root = Path(root)
def local_path(self, rel_path: str) -> os.PathLike[str]:
return self._root / rel_path
async def stat(self, rel_path: str) -> StatResult:
p = self._root / rel_path
if not p.exists():
return StatResult(exists=False, size=0)
return StatResult(exists=True, size=p.stat().st_size)
async def stream(
self, rel_path: str, *, start: int | None = None, end: int | None = None
) -> AsyncIterator[bytes]:
"""로컬 파일을 청크 stream (Range 지원). /file 의 로컬 경로는 FileResponse 가
Range 를 자동 처리하므로 이 메서드는 인터페이스 대칭/원격 동등성을 위한 구현."""
p = self._root / rel_path
with p.open("rb") as f:
if start:
f.seek(start)
remaining = None if end is None else (end - (start or 0) + 1)
while True:
to_read = _STREAM_CHUNK if remaining is None else min(_STREAM_CHUNK, remaining)
if to_read <= 0:
break
data = f.read(to_read)
if not data:
break
yield data
if remaining is not None:
remaining -= len(data)
+33
View File
@@ -0,0 +1,33 @@
"""NasApiBackend — 외부 스토리지(맥미니4TB / NAS Docker API) stub (plan D-1).
★ 미프로비전 = 503. silent fallback 금지(다른 백엔드로 자동 우회 X). 실 프로비전 후
D-3 에서 활성화. infra_inventory.md 갱신(Update Rule) 이 선행이다.
"""
from __future__ import annotations
import os
from collections.abc import AsyncIterator
from .base import StatResult, StorageBackend, StorageNotConfigured
_MSG = "NasApiBackend 미구성 — 외부 스토리지 프로비전 후 활성(D-3). silent fallback 없음."
class NasApiBackend(StorageBackend):
is_local = False
def __init__(self, base_url: str | None = None) -> None:
self._base_url = base_url
def local_path(self, rel_path: str) -> os.PathLike[str] | None:
return None
async def stat(self, rel_path: str) -> StatResult:
raise StorageNotConfigured(_MSG)
async def stream(
self, rel_path: str, *, start: int | None = None, end: int | None = None
) -> AsyncIterator[bytes]:
raise StorageNotConfigured(_MSG)
yield b"" # 도달 불가 — async generator 형태 유지용(호출부 `async for` 계약 일치).
+32
View File
@@ -0,0 +1,32 @@
"""야간 dedup 컬럼 재계산 잡 (plan ds-s1-backend-1 B-4 '야간 배치').
duplicate_of / duplicate_count 는 비정규화 캐시다. 문서는 soft-delete only(deleted_at)라
FK ON DELETE SET NULL 이 발화하지 않아, canonical/멤버를 soft-delete 하면 잔여 드리프트가
생긴다(멤버의 stale 포인터·canonical overcount). B-1 업로드 채움은 신규 행만 다루므로,
이 야간 절대 재계산이 전체 정합을 보장한다. 멱등 — 드리프트 없으면 no-op(로그만).
응답 계약(DocumentResponse.duplicate_count/duplicate_of)을 앱(S3)이 읽으므로 정합이 중요.
"""
import logging
from core.database import async_session
from services.dedup import reconcile_dedup
logger = logging.getLogger("dedup_reconcile")
async def run() -> None:
try:
async with async_session() as session:
r = await reconcile_dedup(session, apply=True)
if r["changes"]:
logger.info(
"[dedup_reconcile] groups=%s docs=%s changes=%s applied=%s",
r["groups"], r["docs"], r["changes"], r["applied"],
)
else:
logger.info(
"[dedup_reconcile] no drift (groups=%s docs=%s)", r["groups"], r["docs"]
)
except Exception:
logger.exception("[dedup_reconcile] failed")
+72 -6
View File
@@ -17,6 +17,7 @@ md_content ref 형식: `![alt](docimg:img_001)` — image_key 가 sequence 기
plan: ~/.claude/plans/piped-humming-crystal.md
"""
import asyncio
import base64
import hashlib
import json
@@ -68,9 +69,13 @@ _FORMAT_TO_MIME = {
"gif": "image/gif",
}
# Phase 1B = PDF only. DOCX 등은 후속 Phase.
# Phase 1B = PDF only (marker-service). office/hwp 는 C-2 에서 office_md 하이브리드로 분기.
SUPPORTED_EXTENSIONS = {".pdf"}
# C-2: office/hwp → md (OOXML=markitdown / hwp=LibreOffice). 변환기가 지원하는 suffix 집합.
# 레거시 바이너리(.doc/.xls/.ppt)는 markitdown 미지원 → 여기 없음(=PDF-only 게이트에서 skip).
from workers.office_md import SUPPORTED as OFFICE_MD_SUPPORTED # noqa: E402
# config.yaml document_types 의 한국어 label 직접 사용 (Pre-flight 결과).
# Round 0 사용자 의도 = 표 중심 발주/계산/명세 도메인.
SKIP_DOC_TYPES = {
@@ -177,9 +182,18 @@ async def process(document_id: int, session: AsyncSession) -> None:
return
container_path = _to_marker_path(doc.file_path)
# ---- (3) PDF only ----
suffix = Path(container_path).suffix.lower()
# ---- (3) office/hwp → md (C-2): PDF 외 지원 포맷은 office_md 하이브리드 변환 ----
if suffix in OFFICE_MD_SUPPORTED:
await session.execute(
update(Document).where(Document.id == document_id).values(md_status="processing")
)
await session.commit()
await _process_office(doc, document_id, container_path, session)
return
# ---- (3.5) PDF only (그 외 확장자 = skip) ----
if suffix not in SUPPORTED_EXTENSIONS:
logger.info(f"markdown_skip_unsupported_extension id={document_id} ext={suffix}")
await _set_skipped(
@@ -368,6 +382,56 @@ async def _process_markdown_passthrough(
)
async def _process_office(
doc: Document, document_id: int, container_path: str, session: AsyncSession
) -> None:
"""office/hwp → md (C-2). C-5 상태머신 postcondition 의 office arm.
office_md.convert_office_to_md 는 이진 계약: 성공=비공백 md 반환 / 실패·빈출력·타임아웃·
의존성부재=OfficeMdError raise. 따라서:
- 성공 → md_status='success' (+ 비공백 md). 불변식 md_status ∈ {success,partial} ⟹ md 非공백 유지.
- 실패/예외 → _fail (md_status='failed', ¬success·¬skipped). silent 'success+빈md' 절대 없음.
partial arm 은 PDF split 전용 — office 는 이진이라 여기 없음. 'completed' 는 A-3 직렬화 전용(워커 미사용).
quality 는 content-type-aware: office=scored(_compute_quality). 동기 변환은 to_thread 로 event loop 비차단.
"""
from workers.office_md import OfficeMdError, convert_office_to_md
is_hwp = Path(container_path).suffix.lower() in (".hwp", ".hwpx")
engine = "libreoffice_hwp" if is_hwp else "markitdown"
try:
# 동기 subprocess(LibreOffice)/markitdown — 스레드로 빼서 이벤트 루프 비차단.
md_content = await asyncio.to_thread(convert_office_to_md, container_path)
except OfficeMdError as exc:
logger.warning(f"[marker] office md 변환 실패 id={document_id} engine={engine}: {exc}")
await _fail(session, document_id, f"office_md: {str(exc)[:990]}", engine=engine)
return
except Exception as exc: # 예기치 못한 예외도 failed (success+빈md 절대 금지)
logger.exception(f"[marker] office md unexpected error id={document_id}: {exc}")
await _fail(session, document_id, f"office_md_unexpected: {str(exc)[:980]}", engine=engine)
return
# 성공 — 계약상 md_content 는 비공백(빈출력은 raise). quality scored.
quality = _compute_quality(md_content, doc.extracted_text or "", {"page_count": None})
await session.execute(
update(Document).where(Document.id == document_id).values(
md_content=md_content,
md_status="success",
md_extraction_engine=engine,
md_extraction_engine_version=None,
md_extraction_quality=quality,
md_content_hash=hashlib.sha256(md_content.encode("utf-8")).hexdigest(),
md_source_hash=doc.file_hash,
md_generated_at=_now(),
md_extraction_error=None,
md_frontmatter=doc.md_frontmatter or {},
md_format_version="1.0",
content_origin="extracted",
)
)
await session.commit()
logger.info(f"[marker] office success id={document_id} engine={engine} len={len(md_content)}")
async def _process_split(
doc: Document,
document_id: int,
@@ -779,15 +843,17 @@ async def _set_skipped(session: AsyncSession, document_id: int, reason: str) ->
await session.commit()
async def _fail(session: AsyncSession, document_id: int, error: str) -> None:
"""doc-level failed (재시도 무의미)."""
async def _fail(
session: AsyncSession, document_id: int, error: str, *, engine: str = "marker"
) -> None:
"""doc-level failed (재시도 무의미). engine = 실패한 변환 엔진(office=markitdown/libreoffice_hwp)."""
await session.execute(
update(Document).where(Document.id == document_id).values(
md_status="failed",
md_content=None,
md_content_hash=None,
md_extraction_error=error,
md_extraction_engine="marker",
md_extraction_engine=engine,
md_generated_at=_now(),
content_origin="extracted",
)
+136
View File
@@ -0,0 +1,136 @@
"""office/hwp → Markdown 하이브리드 변환기 (plan ds-s1-backend-1, C-1 PoC).
PoC 상태 marker_worker 아직 연결하지 않음(그건 C-2). 모듈은 변환 *계약*
PoC 하니스(scripts/poc_office_md.py) 호출하는 순수 함수만 제공한다.
전략 (하이브리드):
- OOXML(.docx/.xlsx/.pptx) markitdown 신규 의존성(pip install markitdown). lazy import.
- .hwp/.hwpx LibreOffice(headless) HTML markdownify markdownify 기존 의존성.
(LibreOffice hwp import 필터 보유. .hwpx .hwp 다른 필터·버전 의존 E-1: prod LibreOffice
버전핀 안전컨텍스트에서 PoC 실행. fidelity 진짜 리스크 하니스가 측정.)
실패 계약 (C-5 postcondition backend 절반):
변환 실패· 출력·타임아웃·의존성 부재 OfficeMdError raise 한다.
**success + md 절대 반환하지 않는다** 호출부(C-2 marker_worker) 이를 잡아
md_status='failed'(¬success·¬skipped) 라우팅한다. 불변식: md_status {success,partial} md_content 非공백.
"""
from __future__ import annotations
import os
import shutil
import subprocess
import tempfile
from pathlib import Path
OOXML_FORMATS = {".docx", ".xlsx", ".pptx"}
HWP_FORMATS = {".hwp", ".hwpx"}
SUPPORTED = OOXML_FORMATS | HWP_FORMATS
# 빈 출력 판정 임계 — 공백 제거 후 이 미만이면 '실패(빈 변환)'로 본다.
_MIN_BODY_CHARS = 16
# extract_worker.py 가 이미 `libreoffice` 바이너리로 office 텍스트 추출에 성공(컨테이너 검증된
# 이름) → 기본값 정합. soffice 만 있는 환경은 LIBREOFFICE_BIN 으로 override.
_SOFFICE_BIN = os.environ.get("LIBREOFFICE_BIN", "libreoffice")
class OfficeMdError(Exception):
"""office/hwp → md 변환 실패 신호. 호출부는 md_status='failed' 로 라우팅."""
def convert_office_to_md(path: str | Path, *, timeout: int = 90) -> str:
"""office/hwp 파일을 Markdown 문자열로 변환. 실패/빈출력 시 OfficeMdError raise."""
p = Path(path)
suffix = p.suffix.lower()
if suffix not in SUPPORTED:
raise OfficeMdError(f"unsupported suffix for office_md: {suffix!r}")
if not p.exists():
raise OfficeMdError(f"file not found: {p}")
if suffix in OOXML_FORMATS:
md = _via_markitdown(p)
else: # .hwp / .hwpx
md = _via_libreoffice_html(p, timeout=timeout)
md = (md or "").strip()
if len(md) < _MIN_BODY_CHARS:
raise OfficeMdError(f"empty/too-short conversion ({len(md)} chars) for {p.name}")
return md
def _via_markitdown(path: Path) -> str:
try:
from markitdown import MarkItDown # lazy — 신규 의존성
except ImportError as e: # noqa: BLE001
raise OfficeMdError(
"markitdown 미설치 (OOXML 변환에 필요) — `pip install markitdown`. "
"C-1 PoC 는 prod worker 이미지/버전핀 컨텍스트에서 실행(E-1)."
) from e
try:
result = MarkItDown().convert(str(path))
except Exception as e: # noqa: BLE001 — 어떤 변환 예외든 failed 로 라우팅
raise OfficeMdError(f"markitdown 변환 실패: {path.name}: {e}") from e
return getattr(result, "text_content", "") or ""
def _via_libreoffice_html(path: Path, *, timeout: int) -> str:
"""LibreOffice headless 로 HTML 변환 후 markdownify. hwp/hwpx 용."""
try:
from markdownify import markdownify # 기존 의존성
except ImportError as e: # noqa: BLE001
raise OfficeMdError("markdownify 미설치(기존 의존성이어야 함)") from e
with tempfile.TemporaryDirectory(prefix="office_md_") as tmp:
tmpdir = Path(tmp)
# soffice 동시 실행 시 user profile 락 충돌 회피 — 호출별 격리 프로필.
profile = tmpdir / "lo_profile"
cmd = [
_SOFFICE_BIN,
"--headless",
"--nologo",
"--nofirststartwizard",
f"-env:UserInstallation=file://{profile}",
"--convert-to",
"html",
"--outdir",
str(tmpdir),
str(path),
]
try:
proc = subprocess.run(
cmd, capture_output=True, text=True, timeout=timeout, check=False
)
except FileNotFoundError as e:
raise OfficeMdError(
f"LibreOffice 바이너리 부재({_SOFFICE_BIN}) — LIBREOFFICE_BIN 설정 또는 설치 필요"
) from e
except subprocess.TimeoutExpired as e:
raise OfficeMdError(f"LibreOffice 변환 타임아웃({timeout}s): {path.name}") from e
html_path = tmpdir / f"{path.stem}.html"
if proc.returncode != 0 or not html_path.exists():
raise OfficeMdError(
f"LibreOffice html 변환 실패: {path.name} (rc={proc.returncode}): "
f"{(proc.stderr or proc.stdout or '').strip()[:300]}"
)
html = html_path.read_text(encoding="utf-8", errors="replace")
# 표 보존 위해 markdownify 가 table 을 GFM 으로 — heading_style ATX.
return markdownify(html, heading_style="ATX", strip=["span", "font"])
def table_fidelity(md: str) -> dict:
"""E-1 표 fidelity 의 crude 지표 — GFM 표 행/구분행 카운트 (정밀 평가 아님, 회귀 신호)."""
lines = md.splitlines()
pipe_rows = sum(1 for ln in lines if ln.strip().startswith("|") and ln.strip().endswith("|"))
sep_rows = sum(
1 for ln in lines
if ln.strip().startswith("|") and set(ln.strip()) <= set("|-: ")
)
return {
"chars": len(md),
"lines": len(lines),
"table_pipe_rows": pipe_rows,
"table_separator_rows": sep_rows, # 표 개수의 근사
"has_heading": any(ln.lstrip().startswith("#") for ln in lines),
}
+20 -4
View File
@@ -1,16 +1,32 @@
import SwiftUI
import AppFeature
/// Thin @main entry: window + DI only. Injects AppModel (FixtureDSClient + AIRouter(MockAIProvider))
/// so the whole pipeline renders with zero real backend / zero real LLM. Feature logic lives in
/// AppFeature, keeping the seam to a future Xcode/iPhone target trivial.
/// Thin @main entry: window + DI only. = (GPU DS) (AppModel.live
/// LiveDSClient + AIFabric , base publicTLS = https://document.hyungi.net/api).
/// env : DSAPP_FIXTURE=1 (Fixture+Mock) / DSAPP_DS_URL base
/// (: http://100.110.63.63:8000/api). Feature logic lives in AppFeature, keeping the seam to a
/// future iPhone/Watch target trivial.
@main
struct DSApp: App {
@State private var model: AppModel
@MainActor
init() {
_model = State(initialValue: AppModel.preview)
let env = ProcessInfo.processInfo.environment
let initial: AppModel
if env["DSAPP_FIXTURE"] == "1" {
initial = .preview
} else if let raw = env["DSAPP_DS_URL"] {
// dev prod(publicTLS) silent fallback , .
let trimmed = raw.hasSuffix("/") ? String(raw.dropLast()) : raw
guard let url = URL(string: trimmed), url.scheme != nil, url.host() != nil else {
fatalError("DSAPP_DS_URL 파싱 실패: \(raw)")
}
initial = .live(base: .custom(url))
} else {
initial = .live()
}
_model = State(initialValue: initial)
}
var body: some Scene {
+5
View File
@@ -47,6 +47,11 @@ let package = Package(
dependencies: ["DSKit", "AIFabric"],
swiftSettings: [.swiftLanguageMode(.v6)]
),
.testTarget(
name: "AppFeatureTests",
dependencies: ["AppFeature", "DSKit"],
swiftSettings: [.swiftLanguageMode(.v6)]
),
.testTarget(
name: "AITests",
dependencies: ["AIFabric"],
@@ -11,15 +11,14 @@ struct DashboardView: View {
if let s = model.stats {
LazyVGrid(columns: [GridItem(.adaptive(minimum: 150), spacing: 12)], spacing: 12) {
StatCard(title: "전체", value: s.total, color: Sage.brand)
StatCard(title: "문서", value: s.documents, color: Sage.brand)
StatCard(title: "검토 대기", value: s.reviewPending, color: Sage.amber)
StatCard(title: "파이프라인 실패", value: s.pipelineFailed, color: Sage.danger)
StatCard(title: "문서", value: s.counts["document"] ?? 0, color: Sage.brand)
StatCard(title: "승인 대기", value: s.libraryPendingSuggestions, color: Sage.amber)
}
VStack(alignment: .leading, spacing: 10) {
Text("도메인 분포").font(.headline).foregroundStyle(Sage.ink)
ForEach(s.byDomain.sorted { $0.value > $1.value }, id: \.key) { key, value in
DomainBar(name: key, count: value, max: s.byDomain.values.max() ?? 1)
Text("카테고리 분포").font(.headline).foregroundStyle(Sage.ink)
ForEach(s.counts.sorted { $0.value > $1.value }, id: \.key) { key, value in
DomainBar(name: Self.categoryLabel(key), count: value, max: s.counts.values.max() ?? 1)
.contentShape(Rectangle())
.onTapGesture { model.section = .documents }
}
@@ -35,4 +34,18 @@ struct DashboardView: View {
}
.background(Sage.surface)
}
/// category enum ( raw ).
static func categoryLabel(_ key: String) -> String {
switch key {
case "document": return "문서"
case "library": return "자료실"
case "news": return "뉴스"
case "law": return "법령"
case "memo": return "메모"
case "audio": return "오디오"
case "video": return "비디오"
default: return key
}
}
}
@@ -0,0 +1,97 @@
import SwiftUI
import DSKit
/// (GPU DS) ( FU-E ).
/// refresh / ; HttpOnly refresh
/// . TOTP ( ).
public struct LoginView: View {
@Environment(AppModel.self) private var model
@State private var username = ""
@State private var password = ""
@State private var totp = ""
@State private var submitting = false
@FocusState private var focus: Field?
private enum Field { case username, password, totp }
public init() {}
public var body: some View {
VStack(spacing: 0) {
Spacer()
VStack(alignment: .leading, spacing: 16) {
VStack(alignment: .leading, spacing: 4) {
Text("Document Server")
.font(.title2.weight(.semibold))
.foregroundStyle(Sage.ink)
Text(serverHost)
.font(.caption)
.foregroundStyle(Sage.muted)
}
VStack(spacing: 10) {
TextField("아이디", text: $username)
.textFieldStyle(.roundedBorder)
.focused($focus, equals: .username)
.onSubmit { focus = .password }
SecureField("비밀번호", text: $password)
.textFieldStyle(.roundedBorder)
.focused($focus, equals: .password)
.onSubmit { submit() }
TextField("2FA 코드 (설정한 경우)", text: $totp)
.textFieldStyle(.roundedBorder)
.focused($focus, equals: .totp)
.onSubmit { submit() }
}
if let error = model.loginError {
Text(error)
.font(.callout)
.foregroundStyle(Sage.danger)
.fixedSize(horizontal: false, vertical: true)
}
Button(action: submit) {
Group {
if submitting {
ProgressView().controlSize(.small)
} else {
Text("로그인")
}
}
.frame(maxWidth: .infinity)
}
.buttonStyle(.borderedProminent)
.tint(Sage.brand)
.disabled(submitting || username.isEmpty || password.isEmpty)
}
.padding(28)
.frame(width: 360)
.background(Sage.card, in: RoundedRectangle(cornerRadius: 12))
.overlay(RoundedRectangle(cornerRadius: 12).stroke(Sage.line))
Spacer()
}
.frame(maxWidth: .infinity, maxHeight: .infinity)
.background(Sage.surface)
.onAppear { focus = .username }
}
/// base host (: document.hyungi.net / 100.110.63.63).
private var serverHost: String {
model.base.url.host() ?? model.base.url.absoluteString
}
private func submit() {
guard !submitting, !username.isEmpty, !password.isEmpty else { return }
submitting = true
Task {
await model.login(username: username, password: password, totp: totp)
submitting = false
}
}
}
#if DEBUG
#Preview("로그인") {
@Previewable @State var model = AppModel.preview
LoginView()
.environment(model)
.frame(width: 700, height: 500)
}
#endif
@@ -3,6 +3,7 @@ import DSKit
/// DEVONthink-style 3-column shell. RootView only ROUTES; each page owns its own interior treatment
/// (no shell-level auto-inherit). macOS-only target.
/// : checking( refresh ) loggedOut(LoginView) ready(3-pane ).
public struct RootView: View {
@Environment(AppModel.self) private var model
@State private var columnVisibility: NavigationSplitViewVisibility = .all
@@ -10,6 +11,22 @@ public struct RootView: View {
public init() {}
public var body: some View {
Group {
switch model.authPhase {
case .checking:
ProgressView("서버 연결 확인 중")
.frame(maxWidth: .infinity, maxHeight: .infinity)
.background(Sage.surface)
case .loggedOut:
LoginView()
case .ready:
shell
}
}
.task { await model.bootstrap() }
}
private var shell: some View {
NavigationSplitView(columnVisibility: $columnVisibility) {
Sidebar()
.navigationSplitViewColumnWidth(min: 220, ideal: 250)
@@ -21,7 +38,24 @@ public struct RootView: View {
}
.navigationSplitViewStyle(.balanced)
.tint(Sage.brand)
.task { await model.loadInitial() }
.safeAreaInset(edge: .bottom) {
// (no-silent-fallback) .
if let err = model.errorText {
HStack(spacing: 10) {
Text(err)
.font(.callout)
.foregroundStyle(.white)
.lineLimit(2)
Spacer()
Button("닫기") { model.errorText = nil }
.buttonStyle(.plain)
.foregroundStyle(.white.opacity(0.85))
}
.padding(.horizontal, 14)
.padding(.vertical, 8)
.background(Sage.danger)
}
}
}
}
@@ -120,7 +154,6 @@ struct EmptyState: View {
@Previewable @State var model = AppModel.preview
RootView()
.environment(model)
.task { await model.loadInitial() }
.frame(minWidth: 1000, minHeight: 660)
}
#endif
@@ -23,6 +23,10 @@ public final class AppModel {
}
}
/// : refresh (checking) (loggedOut)
/// (ready). Fixture refresh fixture ready.
public enum AuthPhase: Equatable { case checking, loggedOut, ready }
public var section: Section = .dashboard
public var selectedDocumentID: Int?
public var selectedMemoID: Int?
@@ -41,14 +45,23 @@ public final class AppModel {
public var digest: DigestResponse?
public var errorText: String?
public private(set) var authPhase: AuthPhase = .checking
/// ( ).
public var loginError: String?
/// bootstrap single-shot ( ).
private var didBootstrap = false
let client: any DSClient
let ai: AIService
/// Placeholder token from the auth fixture builds a real-SHAPED download URL with no expectation it resolves offline.
/// DS base URL (live()/preview ).
let base: DSBaseURL
/// access ( ?token= ). bootstrap/login .
public private(set) var accessToken: String = ""
public init(client: any DSClient, ai: AIService) {
public init(client: any DSClient, ai: AIService, base: DSBaseURL = .publicTLS) {
self.client = client
self.ai = ai
self.base = base
}
@MainActor
@@ -56,8 +69,66 @@ public final class AppModel {
AppModel(client: FixtureDSClient(), ai: AIService(router: AppAIComposition.mockRouter()))
}
/// (GPU DS) : LiveDSClient + AIFabric (realRouter). ask closure
/// client TokenProvider (401 refresh ). = InMemory
/// access 15 , HttpOnly refresh (7,
/// HTTPCookieStorage ) . Keychain .
@MainActor
public static func live(
base: DSBaseURL = .publicTLS,
persistence: TokenPersistence = InMemoryTokenStore()
) -> AppModel {
let client = LiveDSClient(base: base, persistence: persistence)
let router = AppAIComposition.realRouter(base: base) { await client.currentAccessToken() }
return AppModel(client: client, ai: AIService(router: router), base: base)
}
/// 1 (single-shot / .task ):
/// refresh . 401( /) = loggedOut( ) /
/// ( ) = loggedOut + loginError (no-silent-fallback) /
/// task ( ) = appear .
public func bootstrap() async {
guard !didBootstrap else { return }
didBootstrap = true
// authPhase .checking ready UI .
do {
let token = try await client.refresh().accessToken
accessToken = token
authPhase = .ready
await loadInitial()
} catch let e as DSError where e.isAuthExpired {
authPhase = .loggedOut
} catch {
if Task.isCancelled {
didBootstrap = false
return
}
authPhase = .loggedOut
loginError = (error as? LocalizedError)?.errorDescription ?? "\(error)"
}
}
/// (POST /auth/login JWT). totp / .
public func login(username: String, password: String, totp: String?) async {
loginError = nil
do {
let code = totp.map {
$0.replacingOccurrences(of: " ", with: "").trimmingCharacters(in: .whitespacesAndNewlines)
}
let response = try await client.login(
username: username,
password: password,
totpCode: (code?.isEmpty ?? true) ? nil : code
)
accessToken = response.accessToken
authPhase = .ready
await loadInitial()
} catch {
loginError = (error as? LocalizedError)?.errorDescription ?? "\(error)"
}
}
public func loadInitial() async {
await guarded { self.accessToken = (try? await self.client.login(username: "hyungi", password: "x", totpCode: nil).accessToken) ?? "" }
await guarded { self.tree = try await self.client.documentTree() }
await guarded { self.stats = try await self.client.categoryCounts() }
await guarded { self.documentList = try await self.client.documents(DocumentListQuery()).items }
@@ -88,11 +159,26 @@ public final class AppModel {
public func downloadURL(for doc: DocumentResponse) -> URL? {
guard doc.hasDownloadableOriginal, !accessToken.isEmpty else { return nil }
return DSDownload.fileURL(base: .publicTLS, documentID: doc.id, accessToken: accessToken)
return DSDownload.fileURL(base: base, documentID: doc.id, accessToken: accessToken)
}
private func guarded(_ work: () async throws -> Void) async {
do { try await work() }
catch { errorText = (error as? LocalizedError)?.errorDescription ?? "\(error)" }
do {
try await work()
} catch let e as DSError where e.isAuthExpired {
// LiveDSClient refresh+ (refresh /) .
authPhase = .loggedOut
loginError = "세션이 만료되었습니다. 다시 로그인하세요."
} catch {
errorText = (error as? LocalizedError)?.errorDescription ?? "\(error)"
}
await syncAccessToken()
}
/// 401 (LiveDSClient refresh) ?token= guarded
/// . = TokenProvider.
private func syncAccessToken() async {
guard let live = client as? LiveDSClient, let t = await live.currentAccessToken() else { return }
if t != accessToken { accessToken = t }
}
}
@@ -39,6 +39,9 @@ public final class LiveDSClient: DSClient, @unchecked Sendable {
public func setAccessToken(_ token: String) async { await tokens.set(token) }
/// realRouter ask closure TokenProvider (401 refresh ).
public func currentAccessToken() async -> String? { await tokens.current() }
// MARK: - Request building / sending
private func makeRequest(_ endpoint: DSEndpoint, token: String?) throws -> URLRequest {
@@ -12,19 +12,21 @@ public struct DomainTreeNode: Codable, Sendable, Identifiable {
public var kids: [DomainTreeNode] { children ?? [] }
}
/// GET /documents/stats/category-counts Pydantic response model raw dict
/// . shape (total/by_domain/...) decode
/// 2026-06-07 (fixture documents_stats.json = CAPTURED_LIVE).
public struct CategoryCounts: Codable, Sendable {
public let total: Int
public let documents: Int
public let byDomain: [String: Int]
public let reviewPending: Int
public let pipelineFailed: Int
/// category(enum) : document/library/news/law/memo/audio.
public let counts: [String: Int]
public let libraryPendingSuggestions: Int
enum CodingKeys: String, CodingKey {
case total, documents
case byDomain = "by_domain"
case reviewPending = "review_pending"
case pipelineFailed = "pipeline_failed"
case counts
case libraryPendingSuggestions = "library_pending_suggestions"
}
/// (counts ) .
public var total: Int { counts.values.reduce(0, +) }
}
public struct DuplicateGroup: Codable, Sendable, Identifiable {
@@ -1,14 +1,11 @@
{
"total": 1163,
"documents": 783,
"by_domain": {
"Industrial_Safety": 426,
"Engineering": 351,
"General": 189,
"Programming": 60,
"법령": 23,
"Philosophy": 12
"counts": {
"library": 391,
"law": 229,
"document": 381,
"news": 6182,
"memo": 4,
"audio": 2
},
"review_pending": 725,
"pipeline_failed": 19
"library_pending_suggestions": 0
}
@@ -0,0 +1,182 @@
import XCTest
@testable import AppFeature
import DSKit
/// 0 (Fixture/stub ).
/// bootstrap: refresh =ready / =loggedOut. login: =ready+ / 401= .
final class AppModelAuthTests: XCTestCase {
@MainActor
private func makeModel(client: any DSClient) -> AppModel {
AppModel(client: client, ai: AIService(router: AppAIComposition.mockRouter()))
}
// refresh ( Fixture fixture ) ready +
@MainActor
func testBootstrapRefreshSuccessGoesReady() async {
let model = AppModel.preview
await model.bootstrap()
XCTAssertEqual(model.authPhase, .ready)
XCTAssertFalse(model.accessToken.isEmpty)
XCTAssertFalse(model.documentList.isEmpty, "ready 진입 시 초기 로드까지 수행해야 함")
}
// refresh ( /) loggedOut,
@MainActor
func testBootstrapRefreshFailureGoesLoggedOut() async {
let model = makeModel(client: AuthStubClient(refreshFails: true))
await model.bootstrap()
XCTAssertEqual(model.authPhase, .loggedOut)
XCTAssertTrue(model.accessToken.isEmpty)
XCTAssertTrue(model.documentList.isEmpty)
}
// loggedOut login ready +
@MainActor
func testLoginSuccessTransitionsToReady() async {
let model = makeModel(client: AuthStubClient(refreshFails: true))
await model.bootstrap()
XCTAssertEqual(model.authPhase, .loggedOut)
await model.login(username: "hyungi", password: "pw", totp: nil)
XCTAssertEqual(model.authPhase, .ready)
XCTAssertFalse(model.accessToken.isEmpty)
XCTAssertNil(model.loginError)
XCTAssertFalse(model.documentList.isEmpty)
}
// login 401 loginError + loggedOut +
@MainActor
func testLoginFailureSurfacesErrorAndStaysLoggedOut() async {
let model = makeModel(client: AuthStubClient(refreshFails: true, loginFails: true))
await model.bootstrap()
await model.login(username: "hyungi", password: "wrong", totp: nil)
XCTAssertEqual(model.authPhase, .loggedOut)
XCTAssertNotNil(model.loginError)
XCTAssertTrue(model.accessToken.isEmpty)
}
// totp / totpCode nil ( totp )
@MainActor
func testLoginSendsNilForBlankTotp() async {
let stub = AuthStubClient(refreshFails: true)
let model = makeModel(client: stub)
await model.login(username: "u", password: "p", totp: " ")
XCTAssertNotNil(stub.recordedLogin, "login 이 호출돼야 함")
XCTAssertNil(stub.recordedLogin?.totp, "공백 totp 는 nil 로 정규화")
await model.login(username: "u", password: "p", totp: "123456")
XCTAssertEqual(stub.recordedLogin?.totp, "123456")
}
// totp (/ ) "123 456\n" "123456"
@MainActor
func testLoginNormalizesTotpNewlineAndSpaces() async {
let stub = AuthStubClient(refreshFails: true)
let model = makeModel(client: stub)
await model.login(username: "u", password: "p", totp: "123 456\n")
XCTAssertEqual(stub.recordedLogin?.totp, "123456")
await model.login(username: "u", password: "p", totp: " \n ")
XCTAssertNil(stub.recordedLogin?.totp, "개행+공백뿐이면 nil")
}
// bootstrap single-shot (.task ) refresh 1, ready
@MainActor
func testBootstrapIsSingleShot() async {
let stub = AuthStubClient()
let model = makeModel(client: stub)
await model.bootstrap()
XCTAssertEqual(model.authPhase, .ready)
await model.bootstrap() // appear
XCTAssertEqual(model.authPhase, .ready, "재진입이 checking 으로 리셋하면 안 됨")
XCTAssertEqual(stub.refreshCount, 1, "refresh 는 1회만")
}
// bootstrap transport ( ) loggedOut + ( )
@MainActor
func testBootstrapTransportFailureExposesReason() async {
let model = makeModel(client: AuthStubClient(refreshTransportFails: true))
await model.bootstrap()
XCTAssertEqual(model.authPhase, .loggedOut)
XCTAssertNotNil(model.loginError, "transport 실패 사유가 로그인 화면에 노출돼야 함")
}
// ( refresh+ ) ready loggedOut
@MainActor
func testAuthExpiredDuringUseDemotesToLoggedOut() async {
let stub = AuthStubClient()
let model = makeModel(client: stub)
await model.bootstrap()
XCTAssertEqual(model.authPhase, .ready)
stub.dataAuthExpired = true // 401 (refresh )
await model.openDocument(1)
XCTAssertEqual(model.authPhase, .loggedOut)
XCTAssertNotNil(model.loginError)
}
// live : LiveDSClient + base ( )
@MainActor
func testLiveFactoryComposition() {
let model = AppModel.live(base: .tailscale)
XCTAssertTrue(model.client is LiveDSClient)
XCTAssertEqual(model.base.url.absoluteString, DSBaseURL.tailscale.url.absoluteString)
}
}
/// FixtureDSClient + ( 0).
/// task @unchecked Sendable .
final class AuthStubClient: DSClient, @unchecked Sendable {
private let inner = FixtureDSClient()
private let refreshFails: Bool
private let refreshTransportFails: Bool
private let loginFails: Bool
private(set) var recordedLogin: (username: String, totp: String?)?
private(set) var refreshCount = 0
/// true 401 ( LiveDSClient )
var dataAuthExpired = false
init(refreshFails: Bool = false, refreshTransportFails: Bool = false, loginFails: Bool = false) {
self.refreshFails = refreshFails
self.refreshTransportFails = refreshTransportFails
self.loginFails = loginFails
}
private func gateData() throws {
if dataAuthExpired { throw DSError.unauthorized(message: nil) }
}
// Auth
func login(username: String, password: String, totpCode: String?) async throws -> AccessTokenResponse {
recordedLogin = (username, totpCode)
if loginFails { throw DSError.unauthorized(message: "아이디 또는 비밀번호가 올바르지 않습니다") }
return try await inner.login(username: username, password: password, totpCode: totpCode)
}
func refresh() async throws -> AccessTokenResponse {
refreshCount += 1
if refreshTransportFails { throw DSError.transport(underlying: "Could not connect to the server") }
if refreshFails { throw DSError.unauthorized(message: "refresh failed") }
return try await inner.refresh()
}
func me() async throws -> UserResponse { try await inner.me() }
func logout() async throws { try await inner.logout() }
// Fixture (dataAuthExpired )
func documents(_ query: DocumentListQuery) async throws -> DocumentListResponse { try gateData(); return try await inner.documents(query) }
func document(id: Int) async throws -> DocumentDetailResponse { try gateData(); return try await inner.document(id: id) }
func documentContent(id: Int) async throws -> DocumentContentResponse { try await inner.documentContent(id: id) }
func documentTree() async throws -> [DomainTreeNode] { try await inner.documentTree() }
func categoryCounts() async throws -> CategoryCounts { try await inner.categoryCounts() }
func duplicates() async throws -> DuplicatesResponse { try await inner.duplicates() }
func patchDocument(id: Int, _ update: DocumentUpdate) async throws -> DocumentResponse { try await inner.patchDocument(id: id, update) }
func putContent(id: Int, content: String) async throws { try await inner.putContent(id: id, content: content) }
func deleteDocument(id: Int) async throws { try await inner.deleteDocument(id: id) }
func search(q: String, mode: SearchMode?, page: Int?, debug: Bool?) async throws -> SearchResponse { try await inner.search(q: q, mode: mode, page: page, debug: debug) }
func ask(q: String, limit: Int?, backend: String?, debug: Bool?) async throws -> AskResponse { try await inner.ask(q: q, limit: limit, backend: backend, debug: debug) }
func memos(_ query: MemoListQuery) async throws -> MemoListResponse { try await inner.memos(query) }
func memo(id: Int) async throws -> MemoResponse { try await inner.memo(id: id) }
func createMemo(_ create: MemoCreate) async throws -> MemoResponse { try await inner.createMemo(create) }
func patchMemo(id: Int, _ update: MemoUpdate) async throws -> MemoResponse { try await inner.patchMemo(id: id, update) }
func pinMemo(id: Int, pinned: Bool) async throws -> MemoResponse { try await inner.pinMemo(id: id, pinned: pinned) }
func archiveMemo(id: Int, archived: Bool) async throws -> MemoResponse { try await inner.archiveMemo(id: id, archived: archived) }
func toggleMemoTask(id: Int, taskIndex: Int, checked: Bool) async throws -> MemoResponse { try await inner.toggleMemoTask(id: id, taskIndex: taskIndex, checked: checked) }
func deleteMemo(id: Int) async throws { try await inner.deleteMemo(id: id) }
func digest(date: String?, country: String?) async throws -> DigestResponse { try await inner.digest(date: date, country: country) }
}
@@ -65,8 +65,10 @@ final class FixtureDecodeTests: XCTestCase {
func testStats() async throws {
let s = try await client.categoryCounts()
XCTAssertEqual(s.documents, 783)
XCTAssertEqual(s.byDomain["법령"], 23) // non-ASCII dict key
XCTAssertEqual(s.counts["news"], 6182)
XCTAssertEqual(s.counts["library"], 391)
XCTAssertEqual(s.libraryPendingSuggestions, 0)
XCTAssertEqual(s.total, 391 + 229 + 381 + 6182 + 4 + 2) // = counts
}
func testDuplicates() async throws {
+1 -1
View File
@@ -53,7 +53,7 @@ UserResponse { id: Int, username: String, is_active: Bool, totp_enabled: Bool, l
| GET | `/documents/{id}/file` | `?token=<access>&download=true` | **바이너리 원본** (PDF/이미지/오디오/원본) | — |
| GET | `/documents/{id}/content` | — | 경량 텍스트(`content` 15k cap) | `document_content.json` |
| GET | `/documents/tree` | — | 도메인 트리(사이드바) | `documents_tree.json` |
| GET | `/documents/stats/category-counts` | — | 카테고리 카운트 | `documents_stats.json` |
| GET | `/documents/stats/category-counts` | — | `{counts: {category: n}, library_pending_suggestions}`**raw dict 반환(Pydantic 모델 없음), 2026-06-07 라이브 재캡처로 정정**(초기 추출이 shape 합성 오류) | `documents_stats.json` |
| POST | `/documents/` (multipart) | 파일 업로드 | `DocumentResponse` (201) | `document_detail.json` |
| PATCH | `/documents/{id}` | `DocumentUpdate` | `DocumentResponse` | — |
| PUT | `/documents/{id}/content` | `{content}` (md 편집 저장) | `{}` | — |
@@ -1,14 +1,11 @@
{
"total": 1163,
"documents": 783,
"by_domain": {
"Industrial_Safety": 426,
"Engineering": 351,
"General": 189,
"Programming": 60,
"법령": 23,
"Philosophy": 12
"counts": {
"library": 391,
"law": 229,
"document": 381,
"news": 6182,
"memo": 4,
"audio": 2
},
"review_pending": 725,
"pipeline_failed": 19
"library_pending_suggestions": 0
}
+87
View File
@@ -0,0 +1,87 @@
# S1 데이터·백엔드 트랙 적용 runbook (plan ds-s1-backend-1)
> 코드는 `feat/s1-dedup-fields` 브랜치에 완성. 이 문서는 **prod(GPU) 적용 게이트** 절차.
> ⚠ 적용은 사용자 명시 go 필요 — 본 runbook 은 자동 실행되지 않는다.
## 0. 사전 조건 (게이트)
- [ ] **검색실험 Soft Lock 확인**`~/.claude/.search-experiment-active` 부재여야 함.
현재(2026-06-05) 부재 = 비활성. migration 317 은 startup 자동적용 → `docker compose up`
이 restart 를 유발하므로, 실험 활성 시엔 예외창 합의 후에만.
- [ ] **불가침 면 (검색실험 유효성)**: embedding 모델 / 벡터 인덱스(ivfflat/partial) /
retrieval config / config.yaml 의 ai·model 섹션 **미접촉**. 본 트랙 변경면은
dedup 컬럼 + office_md + storage scaffold(env) 뿐.
## 1. migration 번호
- 317(dedup 3컬럼) **단일** 클레임. P0-4=(C) 무변경이라 신규 migration 미추가.
- S2/S3 트랙이 같은 317 을 발행하지 않도록 조율(startup 카오스 방지).
## 2. restart 셋 (한 번에 배치)
| 서비스 | 변경 | 재시작 사유 |
|---|---|---|
| `fastapi` | A(317 dedup) + B(dedup API) + D(storage scaffold) | startup migration 자동적용 + 코드 |
| `marker_worker`(fastapi 내 스케줄러) | C(office_md 분기) + **markitdown 신규 pip dep** | rebuild 필요 |
> markitdown 은 신규 의존성 → `docker compose build` 필수(force-recreate 만으론 image 미갱신,
> feedback_docker_compose_build_vs_force_recreate). office 변환(OOXML)에만 필요.
## 3. 적용 순서 (inventory → config → deploy → verify)
```bash
ssh gpu && cd ~/Documents/code/hyungi_Document_Server
# (1) pre-A-1 안전망 — DB 덤프 (repo 밖)
bash scripts/s1_pre_change_backup.sh pre-a1
# (2) 코드 가져오기 + 빌드(markitdown dep 반영) + 적용
git fetch && git checkout feat/s1-dedup-fields # 또는 main 머지 후 main
docker compose build fastapi # markitdown 설치 (requirements 에 추가 필요)
docker compose up -d fastapi # startup 에서 migration 317 자동적용
# (3) migration 317 적용 확인
docker compose exec -T postgres psql -U pkm -d pkm -c \
"SELECT version,name FROM schema_migrations WHERE version=317;"
docker compose exec -T postgres psql -U pkm -d pkm -c \
"\d documents" | grep -E 'original_filename|duplicate_of|duplicate_count'
```
> **requirements**: office OOXML 변환에 `markitdown` 추가 필요(`requirements.txt`/pyproject).
> markdownify·LibreOffice 는 기존. 빌드 전 dep 추가 PR 필수(없으면 OOXML 변환이 OfficeMdError→failed,
> hwp/PDF/passthrough 는 정상).
## 4. backfill (코드 적용·검증 후, 야간 비중첩창)
> dedup 컬럼 정합은 **야간 잡 `dedup_reconcile`(03:30 KST, main.py)** 이 매일 멱등 재계산한다
> (soft-delete 잔여 드리프트 자동 정리). 아래 `backfill_dedup.py` 수동 실행은 적용 직후 1회
> 초기 채움/즉시 확인용 — 이후엔 야간 잡이 유지.
```bash
# (4a) dedup backfill (초기 1회) — 먼저 dry-run 으로 정확한 UPDATE set 확인
bash scripts/s1_pre_change_backup.sh pre-b4
docker compose exec fastapi python /app/scripts/backfill_dedup.py --dry-run
docker compose exec fastapi python /app/scripts/backfill_dedup.py --apply
# (4b) office/hwp pending markdown 백필 — C-2 라이브 ingestion 과 비중첩 야간창
docker compose exec fastapi python /app/scripts/backfill_nonpdf_markdown.py --dry-run
docker compose exec fastapi python /app/scripts/backfill_nonpdf_markdown.py --apply --limit 20 # sample 먼저
docker compose exec fastapi python /app/scripts/backfill_nonpdf_markdown.py --apply # 전체
```
## 5. verify (smoke)
```bash
# /duplicates shape
curl -s -H "Authorization: Bearer $TOK" https://document.hyungi.net/api/documents/duplicates | jq '{total_groups,total_duplicate_docs, g0:.groups[0]}'
# office 변환 결과 (sample doc)
docker compose exec -T postgres psql -U pkm -d pkm -c \
"SELECT md_status,md_extraction_engine,length(md_content) FROM documents WHERE id=<office_doc_id>;"
# md_status success→completed 직렬화 (앱 계약)
curl -s -H "Authorization: Bearer $TOK" https://document.hyungi.net/api/documents/<id> | jq '.md_status'
```
## 6. 롤백
- 컬럼만 빠른 롤백: `scripts/rollback_317.sql` (수동, schema_migrations 317 행도 삭제).
- 전체 복원: `scripts/s1_pre_change_backup.sh` 가 출력한 `.sql.gz` → psql 복원.
+203 -116
View File
@@ -3,10 +3,17 @@
import { addToast } from '$lib/stores/toast';
import { marked } from 'marked';
import DOMPurify from 'dompurify';
import { ExternalLink, Save, RefreshCw } from 'lucide-svelte';
import { ExternalLink, Save } from 'lucide-svelte';
import Tabs from '$lib/components/ui/Tabs.svelte';
import MarkdownDoc from '$lib/components/MarkdownDoc.svelte';
import MarkdownStatusBadge from '$lib/components/MarkdownStatusBadge.svelte';
import SectionOutline from '$lib/components/SectionOutline.svelte';
import { getViewerType } from '$lib/utils/viewerType';
import { isMdSuccess } from '$lib/utils/mdStatus';
import { buildAnchorMap } from '$lib/utils/outlineAnchors';
import { cleanHeading } from '$lib/utils/headingPath';
// marked + sanitize
// 편집 미리보기 전용 plain marked (본문 렌더는 MarkdownDoc 가 담당).
marked.use({ mangle: false, headerIds: false });
function renderMd(text) {
return DOMPurify.sanitize(marked(text), {
@@ -22,33 +29,19 @@
let loading = $state(true);
let viewerType = $state('none');
// Markdown 편집
// Markdown 편집 (md/txt — extracted_text 가 표시·편집 단일 필드)
let editMode = $state(false);
let editContent = $state('');
let editTab = $state('edit');
let saving = $state(false);
let rawMarkdown = $state('');
function getViewerType(format) {
if (['md', 'txt'].includes(format)) return 'markdown';
if (format === 'pdf') return 'pdf';
if (['hwp', 'hwpx'].includes(format)) return 'preview-pdf';
if (['odoc', 'osheet', 'docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp'].includes(format)) return 'preview-pdf';
if (['jpg', 'jpeg', 'png', 'gif', 'bmp', 'tiff'].includes(format)) return 'image';
if (['csv', 'json', 'xml', 'html'].includes(format)) return 'text';
if (['dwg', 'dxf'].includes(format)) return 'cad';
return 'unsupported';
}
const ODF_FORMATS = ['ods', 'odt', 'odp', 'odoc', 'osheet'];
function getEditInfo(doc) {
// DB에 저장된 편집 URL 우선
if (doc.edit_url) return { url: doc.edit_url, label: '편집' };
// ODF 포맷 → Synology Drive
if (ODF_FORMATS.includes(doc.file_format)) return { url: 'https://link.hyungi.net', label: 'Synology Drive에서 열기' };
// CAD
if (['dwg', 'dxf'].includes(doc.file_format)) return { url: 'https://web.autocad.com', label: 'AutoCAD Web' };
function getEditInfo(d) {
if (d.edit_url) return { url: d.edit_url, label: '편집' };
if (ODF_FORMATS.includes(d.file_format)) return { url: 'https://link.hyungi.net', label: 'Synology Drive에서 열기' };
if (['dwg', 'dxf'].includes(d.file_format)) return { url: 'https://web.autocad.com', label: 'AutoCAD Web' };
return null;
}
@@ -61,18 +54,19 @@
async function loadFullDoc(id) {
loading = true;
rawMarkdown = '';
sections = [];
try {
fullDoc = await api(`/documents/${id}`);
viewerType = fullDoc.source_channel === 'news' ? 'article' : getViewerType(fullDoc.file_format);
viewerType = getViewerType(fullDoc.file_format, fullDoc.source_channel);
loadSections(id);
// Markdown: extracted_text 없으면 원본 파일 직접 가져오기
// 본문 markdown(md/txt) 인데 extracted_text 가 비면 원본 파일 직접 로드.
if (viewerType === 'markdown' && !fullDoc.extracted_text) {
try {
const resp = await fetch(`/api/documents/${id}/file?token=${getAccessToken()}`);
if (resp.ok) rawMarkdown = await resp.text();
} catch (e) { rawMarkdown = ''; }
} else {
rawMarkdown = '';
}
} catch (err) {
fullDoc = null;
@@ -82,6 +76,86 @@
}
}
// PDF markdown-first: marker 가 만든 canonical md_content 가 있으면 기본으로 그것을 보여주고
// "PDF 원본" 토글 제공. lastDocId 는 prop(fullDoc.id) 로 키잉 — 3-pane 은 라우트 리마운트가
// 없어 page.params 가드는 no-op 이 된다.
let pdfViewMode = $state('markdown');
let lastDocId = $state(null);
let canShowMarkdown = $derived(
!!(isMdSuccess(fullDoc?.md_status) && fullDoc?.md_content?.trim())
);
$effect(() => {
if (!fullDoc) return;
if (fullDoc.id !== lastDocId) {
lastDocId = fullDoc.id;
pdfViewMode = canShowMarkdown ? 'markdown' : 'pdf';
}
if (!canShowMarkdown && pdfViewMode === 'markdown') pdfViewMode = 'pdf';
});
// ── 절 목차(개요) rail + 점프 + scroll-spy (outlineAnchors, 경로 A) ──
let sections = $state([]);
async function loadSections(id) {
try {
const r = await api(`/documents/${id}/sections`);
if (id === doc?.id) sections = r?.sections ?? [];
} catch {
if (id === doc?.id) sections = [];
}
}
// window 빈제목(31% 노이즈) 등 표시 가능한 제목 없는 항목은 rail 에서 제외(클린업).
let outlineSections = $derived(
sections.filter(
(s) => !!(cleanHeading(s.section_title) || cleanHeading((s.heading_path || '').split('>').pop() || '')),
),
);
// MarkdownDoc 가 실제 렌더하는 텍스트(anchor offset 기준과 일치해야 함).
let mdRenderText = $derived.by(() => {
if (!fullDoc) return '';
if (viewerType === 'pdf') return pdfViewMode === 'markdown' && canShowMarkdown ? (fullDoc.md_content || '') : '';
if (viewerType === 'markdown') return fullDoc.extracted_text || rawMarkdown || '';
if (viewerType === 'hwp-markdown' || viewerType === 'article') return fullDoc.md_content || fullDoc.extracted_text || '';
return '';
});
let anchorMap = $derived(sections.length && mdRenderText ? buildAnchorMap(mdRenderText, sections).anchors : {});
let showRail = $derived(outlineSections.length > 0 && !!mdRenderText);
let scrollEl = $state();
let activeKey = $state(null);
function jumpTo(chunkId) {
const el = scrollEl?.querySelector(`#sec-${chunkId}`);
if (el) el.scrollIntoView({ block: 'start', behavior: 'smooth' });
}
// scroll-spy: scrollEl 내 .md-anchor 중 컨테이너 상단(+120) 지난 마지막 = 현재 절.
$effect(() => {
void anchorMap;
const el = scrollEl;
if (!el) return;
let raf = 0;
const onScroll = () => {
if (raf) return;
raf = requestAnimationFrame(() => {
raf = 0;
const threshold = el.getBoundingClientRect().top + 120;
let cur = null;
el.querySelectorAll('.md-anchor').forEach((a) => {
if (a.getBoundingClientRect().top <= threshold) cur = a;
});
if (cur) {
const m = cur.id.match(/^sec-(\d+)$/);
if (m) activeKey = Number(m[1]);
}
});
};
el.addEventListener('scroll', onScroll, { passive: true });
const t = setTimeout(onScroll, 0);
return () => {
el.removeEventListener('scroll', onScroll);
clearTimeout(t);
if (raf) cancelAnimationFrame(raf);
};
});
function startEdit() {
editContent = fullDoc?.extracted_text || rawMarkdown || '';
editMode = true;
@@ -113,6 +187,7 @@
}
let editInfo = $derived(fullDoc ? getEditInfo(fullDoc) : null);
const PROSE = 'prose prose-invert prose-base max-w-none';
</script>
<svelte:window on:keydown={handleKeydown} />
@@ -125,149 +200,161 @@
<div class="flex items-center gap-2">
{#if viewerType === 'markdown'}
{#if editMode}
<button
onclick={saveContent}
disabled={saving}
class="flex items-center gap-1 px-2 py-1 text-xs bg-accent text-white rounded hover:bg-accent-hover disabled:opacity-50"
>
<button onclick={saveContent} disabled={saving}
class="flex items-center gap-1 px-2 py-1 text-xs bg-accent text-white rounded hover:bg-accent-hover disabled:opacity-50">
<Save size={12} /> {saving ? '저장 중...' : '저장'}
</button>
<button
onclick={() => editMode = false}
class="px-2 py-1 text-xs text-dim hover:text-text"
>취소</button>
<button onclick={() => editMode = false} class="px-2 py-1 text-xs text-dim hover:text-text">취소</button>
{:else}
<button
onclick={startEdit}
class="px-2 py-1 text-xs text-dim hover:text-accent border border-default rounded"
>편집</button>
<button onclick={startEdit} class="px-2 py-1 text-xs text-dim hover:text-accent border border-default rounded">편집</button>
{/if}
{/if}
{#if editInfo}
<a
href={editInfo.url}
target="_blank"
rel="noopener"
class="flex items-center gap-1 px-2 py-1 text-xs text-dim hover:text-accent border border-default rounded"
>
<a href={editInfo.url} target="_blank" rel="noopener"
class="flex items-center gap-1 px-2 py-1 text-xs text-dim hover:text-accent border border-default rounded">
<ExternalLink size={12} /> {editInfo.label}
</a>
{/if}
<a
href="/documents/{fullDoc.id}"
class="px-2 py-1 text-xs text-dim hover:text-accent border border-default rounded"
>전체 보기</a>
<a href="/documents/{fullDoc.id}" class="px-2 py-1 text-xs text-dim hover:text-accent border border-default rounded">전체 보기</a>
</div>
</div>
{/if}
<!-- 뷰어 본문 -->
<div class="flex-1 overflow-auto min-h-0">
<!-- 뷰어 본문 (+ 절 목차 rail) -->
<div class="flex-1 flex min-h-0">
{#if showRail}
<aside class="hidden lg:block w-[176px] shrink-0 overflow-y-auto border-r border-default p-2 bg-sidebar">
<SectionOutline sections={outlineSections} onJump={jumpTo} {activeKey} />
</aside>
{/if}
<div class="flex-1 overflow-auto min-h-0" bind:this={scrollEl}>
{#if loading}
<div class="flex items-center justify-center h-full">
<p class="text-sm text-dim">로딩 중...</p>
</div>
<div class="flex items-center justify-center h-full"><p class="text-sm text-dim">로딩 중...</p></div>
{:else if fullDoc}
{#if viewerType === 'markdown'}
{#if editMode}
<!-- Markdown 편집 (Tabs 프리미티브 — E.4) -->
<div class="flex flex-col h-full">
<Tabs
tabs={[
{ id: 'edit', label: '편집' },
{ id: 'preview', label: '미리보기' },
]}
bind:value={editTab}
class="flex flex-col h-full"
>
<Tabs tabs={[{ id: 'edit', label: '편집' }, { id: 'preview', label: '미리보기' }]} bind:value={editTab} class="flex flex-col h-full">
{#snippet children(activeId)}
{#if activeId === 'edit'}
<textarea
bind:value={editContent}
<textarea bind:value={editContent}
class="flex-1 w-full p-4 bg-bg text-text text-sm font-mono resize-none outline-none min-h-[300px]"
spellcheck="false"
aria-label="마크다운 편집"
></textarea>
spellcheck="false" aria-label="마크다운 편집"></textarea>
{:else}
<div class="flex-1 overflow-auto p-4 markdown-body">
{@html renderMd(editContent)}
</div>
<div class="flex-1 overflow-auto p-4 markdown-body">{@html renderMd(editContent)}</div>
{/if}
{/snippet}
</Tabs>
</div>
{:else}
<div class="p-4 markdown-body">
{@html renderMd(fullDoc.extracted_text || rawMarkdown || '*텍스트 추출 대기 중*')}
<!-- md/txt = extracted_text 단일 필드(표시=편집), MarkdownDoc 로 앵커/KaTeX/이미지 렌더 -->
<div class="p-4">
<MarkdownDoc
documentId={fullDoc.id}
mdContent={null}
mdStatus={fullDoc.md_status}
mdExtractionError={fullDoc.md_extraction_error}
mdExtractionQuality={fullDoc.md_extraction_quality}
anchorMap={anchorMap}
extractedText={fullDoc.extracted_text || rawMarkdown}
class={PROSE}
/>
</div>
{/if}
{:else if viewerType === 'pdf'}
<iframe
src="/api/documents/{fullDoc.id}/file?token={getAccessToken()}"
class="w-full h-full border-0"
title={fullDoc.title}
></iframe>
{:else if viewerType === 'preview-pdf'}
<iframe
src="/api/documents/{fullDoc.id}/preview?token={getAccessToken()}"
class="w-full h-full border-0"
title={fullDoc.title}
onerror={() => {}}
></iframe>
{:else if viewerType === 'image'}
<div class="flex items-center justify-center h-full p-4">
<img
src="/api/documents/{fullDoc.id}/file?token={getAccessToken()}"
alt={fullDoc.title}
class="max-w-full max-h-full object-contain rounded"
<div class="p-4">
<div class="mb-2 flex items-center gap-2">
<MarkdownStatusBadge mdStatus={fullDoc.md_status} mdExtractionError={fullDoc.md_extraction_error} mdExtractionQuality={fullDoc.md_extraction_quality} />
{#if canShowMarkdown}
<button onclick={() => (pdfViewMode = 'markdown')}
class="px-2 py-1 text-xs rounded border {pdfViewMode === 'markdown' ? 'bg-accent text-white border-accent' : 'text-dim border-default hover:text-accent'}">Markdown</button>
<button onclick={() => (pdfViewMode = 'pdf')}
class="px-2 py-1 text-xs rounded border {pdfViewMode === 'pdf' ? 'bg-accent text-white border-accent' : 'text-dim border-default hover:text-accent'}">PDF 원본</button>
{/if}
</div>
{#if pdfViewMode === 'markdown' && canShowMarkdown}
<MarkdownDoc
documentId={fullDoc.id}
mdContent={fullDoc.md_content}
mdFrontmatter={fullDoc.md_frontmatter}
mdStatus={fullDoc.md_status}
mdExtractionError={fullDoc.md_extraction_error}
mdExtractionQuality={fullDoc.md_extraction_quality}
anchorMap={anchorMap}
extractedText={fullDoc.extracted_text}
class={PROSE}
/>
{:else}
<iframe src="/api/documents/{fullDoc.id}/file?token={getAccessToken()}" class="w-full h-[80vh] border-0 rounded" title={fullDoc.title}></iframe>
{/if}
</div>
{:else if viewerType === 'hwp-markdown'}
<div class="p-4">
<MarkdownDoc
documentId={fullDoc.id}
mdContent={fullDoc.md_content}
mdFrontmatter={fullDoc.md_frontmatter}
mdStatus={fullDoc.md_status}
mdExtractionError={fullDoc.md_extraction_error}
mdExtractionQuality={fullDoc.md_extraction_quality}
extractedText={fullDoc.extracted_text}
class={PROSE}
/>
</div>
{:else if viewerType === 'preview-pdf'}
<iframe src="/api/documents/{fullDoc.id}/preview?token={getAccessToken()}" class="w-full h-full border-0" title={fullDoc.title} onerror={() => {}}></iframe>
{:else if viewerType === 'image'}
<div class="flex items-center justify-center h-full p-4">
<img src="/api/documents/{fullDoc.id}/file?token={getAccessToken()}" alt={fullDoc.title} class="max-w-full max-h-full object-contain rounded" />
</div>
{:else if viewerType === 'text'}
<div class="p-4">
<pre class="text-sm text-text whitespace-pre-wrap font-mono">{fullDoc.extracted_text || '텍스트 없음'}</pre>
<div class="p-4"><pre class="text-sm text-text whitespace-pre-wrap font-mono">{fullDoc.extracted_text || '텍스트 없음'}</pre></div>
{:else if viewerType === 'synology'}
<div class="flex flex-col items-center justify-center h-full gap-3">
<p class="text-sm text-dim">Synology Office 문서 — 외부 편집기에서 열어야 합니다.</p>
<a href={fullDoc.edit_url || 'https://link.hyungi.net'} target="_blank" rel="noopener"
class="flex items-center gap-1 px-3 py-1.5 text-sm bg-accent text-white rounded-lg hover:bg-accent-hover">
<ExternalLink size={14} /> 새 창에서 열기
</a>
</div>
{:else if viewerType === 'cad'}
<div class="flex flex-col items-center justify-center h-full gap-3">
<p class="text-sm text-dim">CAD 미리보기 (향후 지원 예정)</p>
<a
href="https://web.autocad.com"
target="_blank"
class="px-3 py-1.5 text-sm bg-accent text-white rounded hover:bg-accent-hover"
>AutoCAD Web에서 열기</a>
<a href="https://web.autocad.com" target="_blank" rel="noopener" class="px-3 py-1.5 text-sm bg-accent text-white rounded hover:bg-accent-hover">AutoCAD Web에서 열기</a>
</div>
{:else if viewerType === 'article'}
<!-- 뉴스 전용 뷰어 -->
<div class="p-5 max-w-3xl mx-auto">
<h1 class="text-lg font-bold mb-2">{fullDoc.title}</h1>
<h1 class="text-lg font-bold mb-2 text-text">{fullDoc.title}</h1>
<div class="flex items-center gap-2 mb-4 text-xs text-dim">
{#if fullDoc.ai_tags?.length}
{#each fullDoc.ai_tags.filter(t => t.startsWith('News/')) as tag}
<span class="px-1.5 py-0.5 rounded bg-blue-900/30 text-blue-400">{tag.replace('News/', '')}</span>
<span class="px-1.5 py-0.5 rounded bg-accent/15 text-accent-hover">{tag.replace('News/', '')}</span>
{/each}
{/if}
<span>{new Date(fullDoc.created_at).toLocaleDateString('ko-KR', { year: 'numeric', month: 'short', day: 'numeric', hour: '2-digit', minute: '2-digit' })}</span>
</div>
<div class="markdown-body mb-6">
{@html renderMd(fullDoc.extracted_text || '')}
</div>
<div class="flex items-center gap-3 pt-4 border-t border-default">
{#if fullDoc.edit_url}
<a
href={fullDoc.edit_url}
target="_blank"
rel="noopener noreferrer"
class="flex items-center gap-1 px-3 py-1.5 text-sm bg-accent text-white rounded-lg hover:bg-accent-hover"
>
<MarkdownDoc
documentId={fullDoc.id}
mdContent={fullDoc.md_content}
mdStatus={fullDoc.md_status}
mdExtractionError={fullDoc.md_extraction_error}
mdExtractionQuality={fullDoc.md_extraction_quality}
extractedText={fullDoc.extracted_text}
class="{PROSE} mb-6"
/>
{#if fullDoc.edit_url}
<div class="flex items-center gap-3 pt-4 border-t border-default">
<a href={fullDoc.edit_url} target="_blank" rel="noopener noreferrer"
class="flex items-center gap-1 px-3 py-1.5 text-sm bg-accent text-white rounded-lg hover:bg-accent-hover">
<ExternalLink size={14} /> 원문 보기
</a>
{/if}
</div>
</div>
{/if}
</div>
{:else}
<div class="flex items-center justify-center h-full">
<p class="text-sm text-dim">미리보기를 지원하지 않는 형식입니다 ({fullDoc.file_format})</p>
</div>
<div class="flex items-center justify-center h-full"><p class="text-sm text-dim">미리보기를 지원하지 않는 형식입니다 ({fullDoc.file_format})</p></div>
{/if}
{/if}
</div>
</div>
</div>
+21 -1
View File
@@ -28,6 +28,9 @@
mdStatus?: string | null;
mdExtractionError?: string | null;
mdExtractionQuality?: Record<string, unknown> | null;
/** 개요 점프용 anchor: {chunk_id: md_content char offset}. 렌더 전 해당 위치에
* <span id="sec-{chunk_id}"> 주입(점프 타깃). buildAnchorMap(outlineAnchors) 산출물. */
anchorMap?: Record<number, number> | null;
placeholder?: string;
/** 추가 래퍼 클래스. tailwind prose-* / spacing 등을 호출 측에서 입혀야 할 때. */
class?: string;
@@ -41,10 +44,27 @@
mdStatus = null,
mdExtractionError = null,
mdExtractionQuality = null,
anchorMap = null,
placeholder = '*텍스트 추출 대기 중*',
class: klass = '',
}: Props = $props();
// 개요 anchor 주입: body 의 각 offset(내림차순)에 빈 <span id="sec-N"> 삽입(점프 타깃).
// offset 은 buildAnchorMap 이 body 와 동일 문자열 기준으로 산출했어야 함(호출측 책임).
function spliceAnchors(text: string, map: Record<number, number> | null): string {
if (!map) return text;
const ents = Object.entries(map)
.map(([id, off]) => [id, Number(off)] as [string, number])
.filter(([, o]) => Number.isFinite(o) && o >= 0 && o <= text.length)
.sort((a, b) => b[1] - a[1]);
if (!ents.length) return text;
let out = text;
for (const [id, off] of ents) {
out = out.slice(0, off) + `<span id="sec-${id}" class="md-anchor"></span>\n` + out.slice(off);
}
return out;
}
let usingMarkdown = $derived(!!(mdContent && mdContent.trim()));
let body = $derived(
usingMarkdown
@@ -53,7 +73,7 @@
? extractedText
: placeholder,
);
let renderedHtml = $derived(renderDocMarkdown(body));
let renderedHtml = $derived(renderDocMarkdown(spliceAnchors(body, anchorMap)));
let frontmatterEntries = $derived.by(() => {
if (!usingMarkdown || !mdFrontmatter) return [] as [string, unknown][];
@@ -77,6 +77,7 @@
case 'processing':
return { tone: 'accent', label: 'Markdown 변환 중', tooltip: null };
case 'success':
case 'completed': // API field_validator 가 DB 'success'→'completed' remap (S1 backend) — 동의어
return {
tone: 'success',
label: 'Markdown',
@@ -15,8 +15,12 @@
interface Props {
sections: DocumentSection[];
/** 항목 클릭 시 본문 점프 콜백(부모가 #sec-{chunkId} scrollIntoView). 없으면 아코디언만. */
onJump?: (chunkId: number) => void;
/** scroll-spy 현재 절(chunk_id) — 강조용. */
activeKey?: number | null;
}
let { sections }: Props = $props();
let { sections, onJump, activeKey = null }: Props = $props();
let layout = $derived(groupOrFlat(sections));
let total = $derived(sections.length);
@@ -37,15 +41,17 @@
{#snippet itemRow(item: OutlineItem)}
{@const s = item.section}
{@const open = selectedId === s.chunk_id}
{@const active = activeKey != null && activeKey === s.chunk_id}
{@const typeLabel = sectionTypeLabel(s.section_type)}
<li>
<button
type="button"
onclick={() => toggle(item)}
onclick={() => { toggle(item); onJump?.(s.chunk_id); }}
aria-expanded={open}
aria-current={active ? 'true' : undefined}
class={[
'w-full text-left px-2 py-1.5 rounded-md text-xs flex items-start gap-1.5 transition-colors',
open ? 'bg-surface-active text-text' : 'text-dim hover:bg-surface hover:text-text',
'w-full text-left px-2 py-1.5 rounded-md text-xs flex items-start gap-1.5 transition-colors border-l-2',
open ? 'bg-surface-active text-text border-accent' : active ? 'bg-surface text-accent-hover border-accent' : 'text-dim hover:bg-surface hover:text-text border-transparent',
].join(' ')}
>
<span class="flex-1 min-w-0 leading-snug break-words">{title(s)}</span>
+25
View File
@@ -0,0 +1,25 @@
// md_status 어휘 단일 source.
//
// DB CHECK enum 은 'success' 이지만, API 직렬화 시 field_validator
// `_db_success_to_completed`(app/api/documents.py) 가 'success' → 'completed' 로 remap 한다
// (S1 backend). 나머지 상태(pending/processing/partial/skipped/failed)는 양쪽 동일.
//
// 따라서 프론트는 두 어휘를 모두 "성공" 으로 취급해야 S1 backend 배포 전(API='success')·
// 후(API='completed') 모두 안전하다. (DB↔API enum divergence guard — md_status 비교는
// 반드시 이 헬퍼 경유, raw `=== 'success'` / `=== 'completed'` 산재 금지.)
/** DB 'success' 또는 API 'completed' = 변환 성공(markdown 준비됨). */
export function isMdSuccess(status: string | null | undefined): boolean {
return status === 'success' || status === 'completed';
}
/** md상태 칩 렌더 대상 상태. pending/null 은 숨김(legacy 대량 노이즈 회피). */
export function isMdStatusVisible(status: string | null | undefined): boolean {
return (
status === 'processing' ||
isMdSuccess(status) ||
status === 'partial' ||
status === 'skipped' ||
status === 'failed'
);
}
@@ -0,0 +1,128 @@
// 순수함수 회귀 테스트. 실행(로컬, 의존성 0): node --test src/lib/utils/outlineAnchors.test.ts
// (Node ≥23 또는 22.6+ --experimental-strip-types — TS 타입 네이티브 strip.)
import { test } from 'node:test';
import assert from 'node:assert/strict';
import { buildAnchorMap } from './outlineAnchors.ts';
import { type DocumentSection } from './headingPath.ts';
let _id = 0;
function sec(p: Partial<DocumentSection>): DocumentSection {
return {
chunk_id: ++_id,
section_title: null,
heading_path: null,
level: null,
node_type: null,
is_leaf: true,
section_type: null,
summary: null,
confidence: null,
...p,
};
}
const md = (lines: string[]) => lines.join('\n');
const lineOff = (lines: string[], idx: number) => {
let o = 0;
for (let i = 0; i < idx; i++) o += lines[i].length + 1;
return o;
};
test('ATX heading 정확 매칭 + offset', () => {
const lines = ['# 개요', '본문 a', '## 설계 기준', '본문 b'];
const s = [
sec({ chunk_id: 101, section_title: '개요' }),
sec({ chunk_id: 102, section_title: '설계 기준' }),
];
const r = buildAnchorMap(md(lines), s);
assert.equal(r.anchors[101], lineOff(lines, 0));
assert.equal(r.anchors[102], lineOff(lines, 2));
assert.equal(r.matched, 2);
});
test('★ false early match 방어 — 상호참조가 heading 보다 먼저', () => {
const lines = ['# 개요', '본 절은 Part UW 를 참조한다.', '내용', '# Part UW', '강판'];
const s = [
sec({ chunk_id: 1, section_title: '개요' }),
sec({ chunk_id: 2, section_title: 'Part UW' }),
];
const r = buildAnchorMap(md(lines), s);
// 상호참조(line 1)가 아니라 실제 heading(line 3)으로
assert.equal(r.anchors[2], lineOff(lines, 3));
assert.notEqual(r.anchors[2], lineOff(lines, 1));
});
test('중복 제목 — 단조 커서로 N번째 출현 매칭', () => {
const lines = ['## General', 'a', '## Scope', 'b', '## General', 'c'];
const s = [
sec({ chunk_id: 1, section_title: 'General' }),
sec({ chunk_id: 2, section_title: 'Scope' }),
sec({ chunk_id: 3, section_title: 'General' }),
];
const r = buildAnchorMap(md(lines), s);
assert.equal(r.anchors[1], lineOff(lines, 0)); // 첫 General
assert.equal(r.anchors[2], lineOff(lines, 2)); // Scope
assert.equal(r.anchors[3], lineOff(lines, 4)); // 둘째 General (오점프 아님)
});
test('prefix 가드 — 제1조 가 제1조의2 를 오매칭 안 함', () => {
const lines = ['# 제1조의2', 'x', '# 제1조', 'y'];
const s = [sec({ chunk_id: 1, section_title: '제1조' })];
const r = buildAnchorMap(md(lines), s);
assert.equal(r.anchors[1], lineOff(lines, 2)); // 제1조의2(line0) 아님
});
test('비-ATX 평문 제N조 (전체-라인 매칭)', () => {
const lines = ['제1조(목적) 이 법은 OO 을 정한다.', '본문', '제2조(정의) 용어는...'];
const s = [
sec({ chunk_id: 1, section_title: '제1조(목적) 이 법은 OO 을 정한다.', node_type: 'clause' }),
sec({ chunk_id: 2, section_title: '제2조(정의) 용어는...', node_type: 'clause' }),
];
const r = buildAnchorMap(md(lines), s);
assert.equal(r.anchors[1], lineOff(lines, 0));
assert.equal(r.anchors[2], lineOff(lines, 2));
});
test('window 조각 skip (anchor 없음)', () => {
const lines = ['## 절', 'aaa', 'bbb'];
const s = [
sec({ chunk_id: 1, section_title: '절' }),
sec({ chunk_id: 2, section_title: '절', node_type: 'window' }), // 부모 제목 상속 조각
];
const r = buildAnchorMap(md(lines), s);
assert.equal(r.anchors[1], lineOff(lines, 0));
assert.equal(r.anchors[2], undefined); // window = 점프 비활성
assert.equal(r.total, 1);
});
test('코드펜스 내부 heading 제외', () => {
const lines = ['```', '# General', '```', '# General', 'x'];
const s = [sec({ chunk_id: 1, section_title: 'General' })];
const r = buildAnchorMap(md(lines), s);
assert.equal(r.anchors[1], lineOff(lines, 3)); // 펜스 밖
});
test('miss = anchor 없음 (점프 비활성, 오점프 아님)', () => {
const lines = ['# 개요', '본문'];
const s = [
sec({ chunk_id: 1, section_title: '개요' }),
sec({ chunk_id: 2, section_title: '존재하지 않는 절' }),
];
const r = buildAnchorMap(md(lines), s);
assert.equal(r.anchors[1], lineOff(lines, 0));
assert.equal(r.anchors[2], undefined);
assert.equal(r.total, 2);
assert.equal(r.matched, 1);
});
test('heading_path 마지막 세그먼트 fallback', () => {
const lines = ['# 도입', 'x'];
const s = [sec({ chunk_id: 1, section_title: null, heading_path: 'A > 도입' })];
const r = buildAnchorMap(md(lines), s);
assert.equal(r.anchors[1], lineOff(lines, 0));
});
test('빈 입력 안전', () => {
assert.deepEqual(buildAnchorMap('', [sec({ section_title: 'x' })]).anchors, {});
assert.deepEqual(buildAnchorMap('# x', []).anchors, {});
assert.deepEqual(buildAnchorMap(null, null).anchors, {});
});
+101
View File
@@ -0,0 +1,101 @@
// 개요(절 목차) → 본문 deterministic 점프용 anchor offset 산출 (경로 A: FE-only).
//
// hier 절(section_title)은 md_content 의 heading 라인에서 나왔으나(builder.py build_hier_tree,
// md_content 순수함수), 비-ATX(제N조/Chapter)는 본문에 markdown heading 요소·id 가 안 생기고
// 중복 제목(표-1·Part UW…)이 흔해 슬러그·textContent 매칭이 깨진다. 그래서 md_content 에서
// 각 절의 heading 위치(char offset)를 직접 찾아 <a id="sec-{chunk_id}"> 를 주입할 좌표를 만든다.
//
// ★ false early match 방어 3중 (리뷰 반영):
// 1. 라인-시작(전체-라인) 매칭 — 본문 중간 상호참조("see Part UW for…")는 라인 전체가 제목과
// 같지 않으므로 제외. heading 라인(선두 #/리스트마커 제거 후 전체)만 매칭.
// 2. 전체 매칭 + truncation 처리 — 'first-N-chars' prefix 금지('제1조'가 '제1조의2' 오매칭 차단).
// builder 가 KO/ENG 제목을 [:200] truncate 하므로 truncated(매우 긴 제목)일 때만 startsWith.
// 3. 단조 커서 + 코드펜스 회피 — 매칭은 직전 매칭 다음 라인부터(역행 불가) + ``` ~~~ 펜스 내부 제외.
// 미스/역행은 anchor 없음 = 점프 비활성(아코디언 폴백). 오점프보다 무점프.
//
// ⚠ 잔여 한계: 본문 앞 '목차(TOC)'가 절 제목을 단독 라인으로 순서대로 나열하면 커서가 TOC 를
// 먼저 잡을 수 있다(연쇄 시프트). 4-1 의 '정확도' 측정으로 검출 — 빈번하면 경로 B(builder offset).
import { cleanHeading, type DocumentSection } from './headingPath.ts';
const TRUNCATE_HINT = 180; // builder.py 가 KO/ENG 제목을 [:200] 으로 자름 → 거의 그 길이면 truncated 로 간주
function norm(s: string | null | undefined): string {
return cleanHeading(s).toLowerCase();
}
/** 한 라인을 heading 후보 텍스트로: 선두 ATX #(1~6) / 리스트마커(-*+) / blockquote(>) 제거 후 정규화. */
function normLine(raw: string): string {
const stripped = raw.replace(/^\s{0,3}(?:#{1,6}\s+|[-*+]\s+|>\s+)?/, '');
return cleanHeading(stripped).toLowerCase();
}
export interface AnchorMapResult {
/** chunk_id → md_content 내 heading 라인 시작 char offset. (없으면 점프 비활성) */
anchors: Record<number, number>;
/** 후보(비-window·제목有) 절 수 — 4-1 커버리지 분모. */
total: number;
/** 신뢰 anchor 수 — 4-1 커버리지 분자. (정확도는 별도 수작업 검증) */
matched: number;
}
/**
* sections chunk_index ( ) (GET /documents/{id}/sections ORDER BY).
*/
export function buildAnchorMap(
mdContent: string | null | undefined,
sections: DocumentSection[] | null | undefined,
): AnchorMapResult {
const anchors: Record<number, number> = {};
if (!mdContent || !sections || sections.length === 0) {
return { anchors, total: 0, matched: 0 };
}
// 라인별 (offset, 정규화 텍스트, 펜스 여부) 사전계산.
const rawLines = mdContent.split('\n');
const lines: { off: number; norm: string }[] = [];
let off = 0;
let inFence = false;
for (const raw of rawLines) {
const fenceToggle = /^\s{0,3}(```|~~~)/.test(raw);
const fencedHere = inFence || fenceToggle; // 펜스 경계 라인도 매칭 제외
lines.push({ off, norm: fencedHere ? '' : normLine(raw) });
if (fenceToggle) inFence = !inFence;
off += raw.length + 1; // '\n'
}
let cursor = 0; // 단조 전진 라인 인덱스
let total = 0;
let matched = 0;
for (const s of sections) {
// window/section_split 조각은 자체 heading 없음(부모 제목 상속) → 건너뜀.
if (s.node_type === 'window' || s.node_type === 'section_split') continue;
let nt = norm(s.section_title);
if (!nt && s.heading_path) {
const last = s.heading_path.split('>').pop();
nt = norm(last);
}
if (!nt) continue;
total++;
const truncated = nt.length >= TRUNCATE_HINT;
let foundIdx = -1;
for (let i = cursor; i < lines.length; i++) {
const ln = lines[i].norm;
if (!ln) continue; // 빈 라인 / 펜스 내부
if (ln === nt || (truncated && ln.startsWith(nt))) {
foundIdx = i;
break;
}
}
if (foundIdx >= 0) {
anchors[s.chunk_id] = lines[foundIdx].off;
cursor = foundIdx + 1; // 단조: 다음 절은 이 라인 이후만
matched++;
}
// 미스 → anchor 없음(점프 비활성, 폴백)
}
return { anchors, total, matched };
}
+46
View File
@@ -0,0 +1,46 @@
// 뷰어 타입 분류 단일 source — 상세페이지(/documents/[id])와 3-pane 중앙 리더
// (DocumentViewer)가 공유한다. 두 곳이 각자 getViewerType 을 두면 csv/hwp/office 분기가
// drift 하므로(이원화 재발) 여기 하나로 수렴한다.
//
// ⚠ 소비 컴포넌트는 이 함수가 낼 수 있는 모든 ViewerType 에 render 분기가 있어야 한다.
// (분류 통합 ≠ render 통합 — 양쪽 컴포넌트의 {#if viewerType===...} 에 누락 없는지 확인.)
export type ViewerType =
| 'article'
| 'markdown'
| 'hwp-markdown'
| 'pdf'
| 'preview-pdf'
| 'image'
| 'text'
| 'synology'
| 'cad'
| 'unsupported';
const MARKDOWN = new Set(['md', 'txt']);
// csv/json/xml/html 은 markdown 으로 렌더하면 콤마/행이 한 문단으로 뭉친다 → <pre> 로 원형 보존.
const TEXT = new Set(['csv', 'json', 'xml', 'html']);
const HWP = new Set(['hwp', 'hwpx']);
// LibreOffice headless → PDF preview (/preview) 로 인앱 표시.
const OFFICE_PREVIEW = new Set(['docx', 'xlsx', 'pptx', 'odt', 'ods', 'odp']);
// Synology Office 네이티브 — 인앱 변환 부적합, 외부 편집기로.
const SYNOLOGY = new Set(['odoc', 'osheet']);
const IMAGE = new Set(['jpg', 'jpeg', 'png', 'gif', 'bmp', 'tiff']);
const CAD = new Set(['dwg', 'dxf']);
export function getViewerType(
format: string | null | undefined,
sourceChannel?: string | null,
): ViewerType {
if (sourceChannel === 'news') return 'article';
const f = (format ?? '').toLowerCase();
if (MARKDOWN.has(f)) return 'markdown';
if (f === 'pdf') return 'pdf';
if (HWP.has(f)) return 'hwp-markdown';
if (OFFICE_PREVIEW.has(f)) return 'preview-pdf';
if (SYNOLOGY.has(f)) return 'synology';
if (IMAGE.has(f)) return 'image';
if (TEXT.has(f)) return 'text';
if (CAD.has(f)) return 'cad';
return 'unsupported';
}
+7 -7
View File
@@ -195,7 +195,7 @@
</script>
<div class="p-4 lg:p-8">
<div class="max-w-5xl mx-auto">
<div class="max-w-[1680px] mx-auto">
<!-- ═══ 인사 헤더 ═══ -->
<div class="flex items-baseline gap-2.5 flex-wrap">
@@ -211,7 +211,7 @@
<Skeleton w="w-full" h="h-4" class="mt-4" />
<Skeleton w="w-2/3" h="h-4" class="mt-2" />
</div>
<div class="grid grid-cols-1 lg:grid-cols-[1fr_320px] gap-5">
<div class="grid grid-cols-1 lg:grid-cols-[1fr_360px] gap-5">
<div class="space-y-5">
<div class="bg-surface border border-default rounded-card p-5"><Skeleton w="w-24" h="h-4" /><Skeleton w="w-full" h="h-10" class="mt-3" /></div>
<div class="bg-surface border border-default rounded-card p-5"><Skeleton w="w-24" h="h-4" /><Skeleton w="w-full" h="h-40" class="mt-3" /></div>
@@ -280,7 +280,7 @@
</div>
<!-- ═══ 2열 본문 ═══ -->
<div class="grid grid-cols-1 lg:grid-cols-[1fr_320px] gap-5 items-start">
<div class="grid grid-cols-1 lg:grid-cols-[1fr_360px] gap-5 items-start">
<!-- ─── 왼쪽 ─── -->
<div class="space-y-5">
@@ -329,13 +329,13 @@
<div class="flex flex-col">
{#each summary.recent_documents as doc, i (doc.id)}
<a href="/documents/{doc.id}"
class="grid grid-cols-[auto_14px_1fr] gap-x-3 py-2.5 {i > 0 ? 'border-t border-default' : ''} group">
class="grid grid-cols-[auto_14px_minmax(0,1fr)] gap-x-3 py-2.5 {i > 0 ? 'border-t border-default' : ''} group">
<div class="text-[10px] text-faint text-right pt-1 whitespace-nowrap tabular-nums w-14">{formatTime(doc.created_at)}</div>
<div class="flex flex-col items-center">
<span class="w-2 h-2 rounded-full mt-1.5 shrink-0 {domainBgClass(doc.ai_domain)}"></span>
{#if i < summary.recent_documents.length - 1}<span class="flex-1 w-px bg-default mt-1"></span>{/if}
</div>
<div class="pb-1">
<div class="pb-1 min-w-0">
<div class="text-[10px] font-bold uppercase tracking-wide text-dim mb-0.5">{domainLabel(doc.ai_domain)}</div>
<div class="text-[13px] text-text leading-snug group-hover:text-accent transition-colors truncate">{doc.title || '제목 없음'}</div>
</div>
@@ -382,7 +382,7 @@
{#each domainDist.slice(0, 6) as d (d.name)}
<a href="/documents?domain={encodeURIComponent(d.name)}" class="flex items-center gap-2 text-xs hover:text-accent transition-colors group">
<span class="w-2.5 h-2.5 rounded-sm shrink-0 {domainBgClass(d.name)}"></span>
<span class="flex-1 text-text truncate group-hover:text-accent">{domainLabel(d.name)}</span>
<span class="flex-1 min-w-0 text-text truncate group-hover:text-accent">{domainLabel(d.name)}</span>
<span class="font-semibold text-dim tabular-nums">{d.count.toLocaleString()}</span>
</a>
{/each}
@@ -410,7 +410,7 @@
{#each pinnedMemos as memo (memo.id)}
<a href="/memos" class="flex items-start gap-2.5 px-3 py-2.5 rounded-lg bg-bg hover:bg-surface-hover transition-colors">
<span class="text-[9px] font-bold rounded px-1.5 py-0.5 uppercase tracking-wide shrink-0 mt-0.5 text-accent-hover bg-accent/10">메모</span>
<span class="text-xs text-text leading-snug flex-1">{pinTitle(memo)}</span>
<span class="text-xs text-text leading-snug flex-1 min-w-0 break-words">{pinTitle(memo)}</span>
<Pin size={11} class="text-faint shrink-0 mt-0.5" />
</a>
{/each}
+4 -4
View File
@@ -204,8 +204,8 @@
<div class="h-full overflow-auto">
<!-- 상단 검색바 (sticky) -->
<div class="sticky top-0 z-10 bg-bg/80 backdrop-blur border-b border-default px-4 py-3">
<div class="flex items-center gap-2 max-w-5xl mx-auto">
<div class="relative flex-1">
<div class="flex flex-wrap items-center gap-2 max-w-[1680px] mx-auto">
<div class="relative flex-1 min-w-0">
<Search
size={14}
class="absolute left-3 top-1/2 -translate-y-1/2 text-dim pointer-events-none"
@@ -234,7 +234,7 @@
<select
bind:value={selectedBackend}
title="Backend 선택 — silent fallback 0 정책 (선택한 backend 만 시도, 실패 시 503)."
class="py-2 px-2 bg-surface border border-default rounded-lg text-text text-xs focus:border-accent outline-none"
class="py-2 px-2 bg-surface border border-default rounded-lg text-text text-xs focus:border-accent outline-none min-w-0 max-w-[42vw] truncate"
>
<option value="auto">Auto (router)</option>
<option value="mac-mini-default">Mac mini (default)</option>
@@ -261,7 +261,7 @@
</div>
<!-- 본문 -->
<div class="max-w-5xl mx-auto p-4">
<div class="max-w-[1680px] mx-auto p-4">
{#if backendUnavailable}
<div class="py-16">
<EmptyState
+1 -1
View File
@@ -53,7 +53,7 @@
}
</script>
<div class="p-6 max-w-[1200px] mx-auto">
<div class="p-6 max-w-[1680px] mx-auto">
<header class="flex items-center gap-2 mb-4">
<Mic size={20} />
<h1 class="text-xl font-semibold">Audio</h1>
+1 -1
View File
@@ -428,7 +428,7 @@
/* ── App shell ── */
.app {
max-width: 1180px;
max-width: 1680px;
margin: 0 auto;
background: var(--surface);
min-height: 100vh;
+2 -1
View File
@@ -11,6 +11,7 @@
import { Info, X, Plus, Trash2, Tag, FolderTree, Sparkles, ChevronLeft, ArrowUpDown } from 'lucide-svelte';
import DocumentViewer from '$lib/components/DocumentViewer.svelte';
import MarkdownStatusBadge from '$lib/components/MarkdownStatusBadge.svelte';
import { isMdStatusVisible } from '$lib/utils/mdStatus';
import UploadDropzone from '$lib/components/UploadDropzone.svelte';
import Drawer from '$lib/components/ui/Drawer.svelte';
import Modal from '$lib/components/ui/Modal.svelte';
@@ -679,7 +680,7 @@
{#if doc.ai_sub_group}<div class="flex justify-between gap-2 text-xs py-1"><span class="text-dim">하위</span><span class="text-text font-medium text-right truncate">{doc.ai_sub_group}</span></div>{/if}
<div class="flex justify-between gap-2 text-xs py-1"><span class="text-dim">수정</span><span class="text-text font-medium text-right">{shortDate(doc.updated_at || doc.created_at)}</span></div>
{#if size}<div class="flex justify-between gap-2 text-xs py-1"><span class="text-dim">원본</span><span class="text-text font-medium text-right">{size}</span></div>{/if}
{#if ['processing', 'success', 'partial', 'skipped', 'failed'].includes(doc.md_status)}<div class="flex items-center justify-between gap-2 text-xs py-1"><span class="text-dim">md 상태</span><MarkdownStatusBadge mdStatus={doc.md_status} mdExtractionError={doc.md_extraction_error} mdExtractionQuality={doc.md_extraction_quality} /></div>{/if}
{#if isMdStatusVisible(doc.md_status)}<div class="flex items-center justify-between gap-2 text-xs py-1"><span class="text-dim">md 상태</span><MarkdownStatusBadge mdStatus={doc.md_status} mdExtractionError={doc.md_extraction_error} mdExtractionQuality={doc.md_extraction_quality} /></div>{/if}
{#if doc.read_count}<div class="flex justify-between gap-2 text-xs py-1"><span class="text-dim">읽음</span><span class="text-text font-medium text-right">{doc.read_count}</span></div>{/if}
</div>
</div>
@@ -6,6 +6,8 @@
import { page } from '$app/stores';
import { goto } from '$app/navigation';
import { api, getAccessToken } from '$lib/api';
import { isMdSuccess } from '$lib/utils/mdStatus';
import { buildAnchorMap } from '$lib/utils/outlineAnchors';
import { addToast } from '$lib/stores/toast';
import { marked } from 'marked';
import DOMPurify from 'dompurify';
@@ -147,7 +149,7 @@
let pdfViewMode = $state('markdown'); // 'markdown' | 'pdf'
let lastDocId = $state(null);
let canShowMarkdown = $derived(
!!(doc?.md_status === 'success' && doc?.md_content?.trim())
!!(isMdSuccess(doc?.md_status) && doc?.md_content?.trim())
);
$effect(() => {
@@ -162,6 +164,45 @@
}
});
// ── 개요 점프 (outlineAnchors, 경로 A) ──
// anchorMap = md_content 의 각 절 heading offset. MarkdownDoc 가 <span id="sec-N"> 주입.
let anchorMap = $derived(
hasSections && canShowMarkdown && doc?.md_content
? buildAnchorMap(doc.md_content, sections).anchors
: {}
);
let activeKey = $state(null);
function jumpToSection(chunkId) {
const el = document.getElementById(`sec-${chunkId}`);
if (el) el.scrollIntoView({ behavior: 'smooth', block: 'start' });
}
// scroll-spy: 화면 상단(120px)을 지난 마지막 .md-anchor = 현재 절. [id] 는 window 스크롤.
$effect(() => {
void anchorMap; // 문서/섹션 변화 시 재바인딩
if (typeof window === 'undefined') return;
let raf = 0;
const onScroll = () => {
if (raf) return;
raf = requestAnimationFrame(() => {
raf = 0;
let cur = null;
document.querySelectorAll('.md-anchor').forEach((a) => {
if (a.getBoundingClientRect().top <= 120) cur = a;
});
if (cur) {
const m = cur.id.match(/^sec-(\d+)$/);
if (m) activeKey = Number(m[1]);
}
});
};
window.addEventListener('scroll', onScroll, { passive: true });
onScroll();
return () => {
window.removeEventListener('scroll', onScroll);
if (raf) cancelAnimationFrame(raf);
};
});
function getViewerType(format) {
if (['md', 'txt', 'csv', 'html'].includes(format)) return 'markdown';
if (format === 'pdf') return 'pdf';
@@ -228,7 +269,7 @@
<!-- 좌측 절 목차 — xl+ sticky rail (그 아래 viewport 는 본문 상단 collapsible) -->
<aside class="hidden xl:block xl:sticky xl:top-6 xl:self-start xl:max-h-[calc(100vh-3rem)] xl:overflow-y-auto">
<Card>
<SectionOutline {sections} />
<SectionOutline {sections} onJump={jumpToSection} {activeKey} />
</Card>
</aside>
{/if}
@@ -239,7 +280,7 @@
<!-- xl 미만: 절 목차 접이식 -->
<details class="xl:hidden">
<summary class="cursor-pointer text-sm text-dim px-1 py-2 select-none">절 목차 ({sections.length})</summary>
<Card class="mt-2"><SectionOutline {sections} /></Card>
<Card class="mt-2"><SectionOutline {sections} onJump={jumpToSection} {activeKey} /></Card>
</details>
{/if}
<!-- Affordance row -->
@@ -288,6 +329,7 @@
mdStatus={doc.md_status}
mdExtractionError={doc.md_extraction_error}
mdExtractionQuality={doc.md_extraction_quality}
anchorMap={anchorMap}
extractedText={doc.extracted_text || rawMarkdown}
class="prose prose-invert prose-base lg:prose-sm max-w-none"
/>
+3 -3
View File
@@ -223,7 +223,7 @@
<title>events · hyungi PKM</title>
</svelte:head>
<div class="mx-auto max-w-3xl space-y-6 px-4 py-6">
<div class="mx-auto max-w-[1240px] space-y-6 px-4 py-6 sm:px-6 lg:px-8">
<header class="flex items-end justify-between gap-3">
<div class="space-y-1">
<h1 class="text-2xl font-semibold">events</h1>
@@ -278,13 +278,13 @@
<li>
<Card class="flex items-start gap-3 p-3 {KIND_COLOR[item.kind]}">
<div class="flex-1 min-w-0">
<div class="flex items-center gap-2 text-xs text-slate-500">
<div class="flex min-w-0 items-center gap-2 text-xs text-slate-500">
<span>{KIND_LABEL[item.kind]}</span>
<span class="rounded px-1.5 py-0.5 text-[10px] {STATUS_COLOR[item.status]}">
{STATUS_LABEL[item.status]}
</span>
{#if item.project_tag}
<span class="text-slate-400">#{item.project_tag}</span>
<span class="min-w-0 break-all text-slate-400">#{item.project_tag}</span>
{/if}
</div>
<a href="/events/{item.id}" class="mt-1 block break-words text-sm font-medium hover:underline">
+2 -2
View File
@@ -229,7 +229,7 @@
}
</script>
<div class="p-4 lg:p-6 max-w-5xl mx-auto">
<div class="p-4 lg:p-6 max-w-[1240px] mx-auto">
<!-- 헤더 -->
<div class="flex items-center justify-between mb-4">
<div class="flex items-center gap-3">
@@ -355,7 +355,7 @@
<span class="text-faint"><FormatIcon format={doc.file_format} size={14} /></span>
<a
href="/documents/{doc.id}"
class="text-sm font-medium text-text hover:text-accent truncate"
class="text-sm font-medium text-text hover:text-accent truncate min-w-0"
>
{doc.title || '제목 없음'}
</a>
+6 -6
View File
@@ -438,7 +438,7 @@
<div class="p-4 lg:p-6">
<!-- breadcrumb -->
<div class="flex items-center gap-2 text-sm mb-4 text-dim">
<div class="flex flex-wrap items-center gap-2 text-sm mb-4 text-dim">
<a href="/documents" class="hover:text-text">문서</a>
<span class="text-faint">/</span>
<span class="text-text">자료실</span>
@@ -448,7 +448,7 @@
<button
type="button"
onclick={() => navigate(activePath.split('/').slice(0, i + 1).join('/'))}
class="hover:text-text"
class="hover:text-text min-w-0 truncate max-w-[40vw]"
>
{segment}
</button>
@@ -457,14 +457,14 @@
</div>
<!-- 승인 대기함 (§2) — ai_suggestion.proposed_category='library' 문서 -->
<div class="max-w-7xl mx-auto mb-4">
<div class="max-w-[1680px] mx-auto mb-4">
<SuggestionReview
proposedCategory="library"
onChange={handleSuggestionChange}
/>
</div>
<div class="max-w-7xl mx-auto grid grid-cols-1 lg:grid-cols-12 gap-6">
<div class="max-w-[1680px] mx-auto grid grid-cols-1 lg:grid-cols-12 gap-6">
<!-- 왼쪽: 트리 (5/12) -->
<aside class="lg:col-span-5 xl:col-span-4">
<div class="bg-surface border border-default rounded-card p-3">
@@ -532,14 +532,14 @@
<button
onclick={() => navigate(n.path)}
class="flex-1 flex items-center justify-between px-2 py-1.5 rounded-md text-sm transition-colors
class="flex-1 min-w-0 flex items-center justify-between px-2 py-1.5 rounded-md text-sm transition-colors
{isActive
? 'bg-accent/15 text-accent'
: isParent
? 'text-text'
: 'text-dim hover:bg-surface-hover hover:text-text'}"
>
<span class="truncate">{n.name}</span>
<span class="truncate min-w-0">{n.name}</span>
<span class="text-xs text-dim shrink-0 ml-2">{n.count}</span>
</button>
+4 -2
View File
@@ -656,12 +656,14 @@
</div>
<style>
.memo-content { overflow-wrap: anywhere; word-break: break-word; }
.memo-content :global(p) { margin: 0.2em 0; }
.memo-content :global(ul), .memo-content :global(ol) { margin: 0.2em 0; padding-left: 1.5em; }
.memo-content :global(li) { margin: 0.1em 0; }
.memo-content :global(code) { background: var(--bg); padding: 0.1em 0.3em; border-radius: 3px; font-size: 0.85em; }
.memo-content :global(code) { background: var(--bg); padding: 0.1em 0.3em; border-radius: 3px; font-size: 0.85em; overflow-wrap: anywhere; word-break: break-word; }
.memo-content :global(pre) { background: var(--bg); padding: 0.75em; border-radius: 6px; overflow-x: auto; margin: 0.5em 0; }
.memo-content :global(a) { color: var(--accent); }
.memo-content :global(table) { display: block; overflow-x: auto; max-width: 100%; }
.memo-content :global(a) { color: var(--accent); overflow-wrap: anywhere; word-break: break-word; }
.memo-content :global(blockquote) { border-left: 3px solid var(--border-default); padding-left: 0.75em; color: var(--text-dim); margin: 0.5em 0; }
.memo-content :global(.memo-checkbox) {
cursor: pointer;
+1 -1
View File
@@ -50,7 +50,7 @@
}
</script>
<div class="p-6 max-w-[1400px] mx-auto">
<div class="p-6 max-w-[1680px] mx-auto">
<header class="flex items-center gap-2 mb-4">
<Film size={20} />
<h1 class="text-xl font-semibold">Video</h1>
+18
View File
@@ -0,0 +1,18 @@
-- 317_documents_dedup_fields.sql
-- S1-ADD (plan ds-s1-backend-1, A-1): 원본 파일명 + 중복검사 메타 3컬럼.
-- 계약: ds-app contract/CONTRACT.md [S1-ADD] — original_filename / duplicate_of / duplicate_count.
--
-- asyncpg exec_driver_sql 단일 statement 제약 — ALTER TABLE 다중 ADD COLUMN 절은 단일 statement 라 허용.
-- BEGIN/COMMIT 금지. PG 16: ADD COLUMN ... DEFAULT <constant> 는 fast path (table rewrite 없음).
-- duplicate_of self-FK 는 신규 all-NULL 컬럼이라 검증 스캔 trivial (NOT VALID 불요).
-- ON DELETE SET NULL: 원본(canonical) hard delete 허용 (RESTRICT=삭제 차단 / CASCADE=사본 연쇄삭제 위험 회피).
-- 기존 dup 그룹(law_monitor 제외)의 duplicate_of/duplicate_count backfill 은 B-4 별 배치 스크립트.
-- 28,941행 대량 UPDATE 를 startup migration(단일 트랜잭션)에 넣지 않는다.
--
-- original_filename 은 original_format(ODF 변환용)·original_path/original_hash(migration 007 legacy dead,
-- app 코드 미참조 — P0-1 grep 0건) 와 의미가 다르다: 업로드 시점 원본 파일명(다운로드 라벨용).
ALTER TABLE documents
ADD COLUMN IF NOT EXISTS original_filename TEXT,
ADD COLUMN IF NOT EXISTS duplicate_of BIGINT REFERENCES documents(id) ON DELETE SET NULL,
ADD COLUMN IF NOT EXISTS duplicate_count INTEGER NOT NULL DEFAULT 0;
+90
View File
@@ -0,0 +1,90 @@
"""기존 file_hash 중복 그룹 backfill — plan ds-s1-backend-1 B-4.
목적:
A-1 migration 287 추가된 duplicate_of / duplicate_count *기존* 중복 그룹에 채운다.
migration(단일 트랜잭션) 분리한 배치(database.py:29-30 정책 대량 UPDATE
startup migration 넣지 않는다). 업로드 시점 채움(B-1) 신규 행만 다루므로 과거는 스크립트.
판정:
- file_hash exact 그룹(OFF-whitelist=law_monitor 제외, deleted 제외, count>1).
near_duplicate 영속화 보류(on-the-fly) 여기서 다루지 않는다.
- canonical = 그룹 최古(min id). canonical.duplicate_of=NULL, duplicate_count=group_size-1.
- -canonical 멤버 = duplicate_of=canonical, duplicate_count=0.
안전:
- 멱등 이미 목표값인 행은 UPDATE (재실행 안전). --dry-run 적용될 정확한 set 미리보기.
- --chunk(기본 500)/txn 청크 커밋 28,941 단일 트랜잭션 lock 회피.
실행:
docker compose exec fastapi python /app/scripts/backfill_dedup.py --dry-run
docker compose exec fastapi python /app/scripts/backfill_dedup.py --apply
# 변경 전 안전망은 E-3 pre-B-4 pg_dump (별 단계).
"""
import argparse
import asyncio
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "app"))
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
from services.dedup import reconcile_dedup # 코어 재계산 (야간 잡과 공유)
async def run(*, apply: bool, chunk_size: int) -> int:
database_url = os.getenv(
"DATABASE_URL", "postgresql+asyncpg://pkm:pkm@localhost:5432/pkm"
)
engine = create_async_engine(database_url)
session_factory = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
try:
async with session_factory() as session:
result = await reconcile_dedup(session, apply=apply, chunk_size=chunk_size)
print(f"=== dedup 그룹 {result['groups']}개 · 관련 문서 {result['docs']}건 ===")
if result["groups"] == 0:
print("dedup 그룹 없음(OFF-whitelist 제외 후 count>1 없음) — 종료.")
return 0
already = result["docs"] - result["changes"]
print(f"변경 필요 {result['changes']}건 / 이미 목표값 {already}건 (멱등)")
if result["changes"] == 0:
print("모두 목표값 — 적용할 변경 없음.")
return 0
# 적용될/된 정확한 UPDATE set 미리보기 (상위 40건)
print("\n=== UPDATE set (id → duplicate_of / duplicate_count) ===")
for s in result["sample"]:
role = "canonical" if s["duplicate_of"] is None else f"dup→{s['duplicate_of']}"
print(
f" id={s['id']:>7} duplicate_of={s['duplicate_of']} "
f"duplicate_count={s['duplicate_count']} [{role}]"
)
if result["changes"] > len(result["sample"]):
print(f" ... 외 {result['changes'] - len(result['sample'])}")
if not apply:
print(f"\n[dry-run] {result['changes']}건 변경 예정. --apply 로 실제 적용.")
else:
print(f"\n[apply] 완료 — {result['applied']}건 갱신.")
return 0
finally:
await engine.dispose()
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--apply", action="store_true", help="실제 적용 (기본 dry-run)")
parser.add_argument("--dry-run", action="store_true", help="명시적 dry-run (default 동등)")
parser.add_argument("--chunk", type=int, default=500, help="txn 당 UPDATE 행 수 (기본 500)")
args = parser.parse_args()
if args.apply and args.dry_run:
parser.error("--apply 와 --dry-run 동시 지정 불가")
return asyncio.run(run(apply=args.apply, chunk_size=args.chunk))
if __name__ == "__main__":
sys.exit(main())
+146
View File
@@ -0,0 +1,146 @@
"""과거 office/hwp pending 문서 markdown stage 백필 — plan ds-s1-backend-1 C-4.
신규 ingest classifymarkdown 전이(queue_consumer.py:142) 자동 도달하므로 스크립트는
*과거* office/hwp 행만 다룬다. C-2 office_md 변환을 붙이기 전까지 markdown stage 에서
skip 행들을 다시 큐에 넣어 md_content 생성한다.
대상 (WHERE):
- file_format IN (office_md 지원 실값: docx, xlsx, pptx, hwp, hwpx)
제외 = file_format. INCLUDE 필터가 article(file_format='article') 구조적으로 배제
P0-3 가드(md 없는 article completed 도달 금지, correctness-critical). source_channel 불필요.
레거시 바이너리(.doc/.xls/.ppt) markitdown 미지원 기본 목록 제외(넣어도 marker skip).
- md_status = 'pending' (이미 success/failed/skipped 건드리지 않음)
- extracted_text IS NOT NULL (폴백 존재 모집단)
C-5 failed-postcondition 상속: 변환 실패는 md_status='failed' 시끄럽게 남는다(앱이
'변환 실패' 표시). extracted_text NULL office(폴백 없음) 배제 실패 시끄러운
집합이라 phase2 재검토(C-4 배제 honest).
스케줄:
C-2 라이브 office ingestion 백필 비중첩 markdown 워커는 BATCH=1 직렬이라
야간 단발로 돌려 라이브 office 업로드 stall 회피(plan C-2 reflection).
실행:
docker compose exec fastapi python /app/scripts/backfill_nonpdf_markdown.py --dry-run
docker compose exec fastapi python /app/scripts/backfill_nonpdf_markdown.py --apply
docker compose exec fastapi python /app/scripts/backfill_nonpdf_markdown.py --apply --limit 50
"""
import argparse
import asyncio
import json
import os
import sys
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "app"))
from sqlalchemy import bindparam, text
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
# office_md 가 실제 변환하는 file_format(확장자 소문자, 점 없음). 단일 source.
DEFAULT_FORMATS = ("docx", "xlsx", "pptx", "hwp", "hwpx")
CANDIDATES_SQL = text(
"""
SELECT id, file_format, title, file_path
FROM documents
WHERE deleted_at IS NULL
AND md_status = 'pending'
AND extracted_text IS NOT NULL
AND file_format IN :formats
ORDER BY id
"""
).bindparams(bindparam("formats", expanding=True))
# 활성 markdown 큐 행이 없는 doc 만 통과 (UNIQUE 부분 인덱스). 충돌 = silent skip.
ENQUEUE_SQL = text(
"""
INSERT INTO processing_queue (document_id, stage, status, payload)
VALUES (:doc_id, 'markdown', 'pending', CAST(:payload AS jsonb))
ON CONFLICT DO NOTHING
"""
)
def _chunks(seq, size):
for i in range(0, len(seq), size):
yield seq[i : i + size]
async def run(*, apply: bool, formats: tuple[str, ...], limit: int | None, chunk_size: int) -> int:
database_url = os.getenv(
"DATABASE_URL", "postgresql+asyncpg://pkm:pkm@localhost:5432/pkm"
)
engine = create_async_engine(database_url)
session_factory = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
try:
async with session_factory() as session:
rows = (
await session.execute(CANDIDATES_SQL, {"formats": list(formats)})
).all()
if limit:
rows = rows[:limit]
print(f"=== office/hwp pending 후보 = {len(rows)}건 (formats={','.join(formats)}) ===")
if not rows:
print("후보 없음 — 종료.")
return 0
by_fmt: dict[str, int] = {}
for r in rows:
by_fmt[r.file_format] = by_fmt.get(r.file_format, 0) + 1
print("포맷별:", ", ".join(f"{k}={v}" for k, v in sorted(by_fmt.items())))
for r in rows[:20]:
print(f" id={r.id:>7} {r.file_format:<5} {(r.title or '')[:70]}")
if len(rows) > 20:
print(f" ... 외 {len(rows) - 20}")
if not apply:
print(f"\n[dry-run] {len(rows)}건 markdown 큐 enqueue 예정. --apply 로 실제 적용.")
print(" (적용 전 C-2 라이브 office ingestion 과 비중첩 야간창 확인.)")
return 0
payload = json.dumps(
{"force_reprocess": True, "reason": "c4_nonpdf_markdown_backfill"}
)
inserted = 0
processed = 0
for batch in _chunks(rows, chunk_size):
for r in batch:
result = await session.execute(
ENQUEUE_SQL, {"doc_id": r.id, "payload": payload}
)
if result.rowcount > 0:
inserted += 1
await session.commit()
processed += len(batch)
print(f"[apply] {processed}/{len(rows)} 처리 (enqueue 누적 {inserted})")
print(f"\n[apply] 완료 — {inserted}/{len(rows)} 신규 markdown 큐 추가.")
print(" (skip = 이미 활성 markdown 큐 행이 있는 문서)")
return 0
finally:
await engine.dispose()
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--apply", action="store_true", help="실제 enqueue (기본 dry-run)")
parser.add_argument("--dry-run", action="store_true", help="명시적 dry-run (default 동등)")
parser.add_argument(
"--formats", type=str, default=",".join(DEFAULT_FORMATS),
help=f"쉼표 구분 file_format (기본 {','.join(DEFAULT_FORMATS)})",
)
parser.add_argument("--limit", type=int, default=None, help="후보 상한(샘플 검증용)")
parser.add_argument("--chunk", type=int, default=200, help="enqueue txn 청크 크기")
args = parser.parse_args()
if args.apply and args.dry_run:
parser.error("--apply 와 --dry-run 동시 지정 불가")
formats = tuple(f.strip().lower() for f in args.formats.split(",") if f.strip())
return asyncio.run(
run(apply=args.apply, formats=formats, limit=args.limit, chunk_size=args.chunk)
)
if __name__ == "__main__":
sys.exit(main())
+77
View File
@@ -0,0 +1,77 @@
#!/usr/bin/env python3
"""C-1 PoC 하니스 — office/hwp → md 변환 품질(특히 표 fidelity) 측정.
plan ds-s1-backend-1 C-1/E-1:
- hwp/hwpx 결과는 LibreOffice 버전 의존 **prod extract_worker 동일 버전(버전핀 안전컨텍스트)** 에서 실행해야
신호가 transfer . live worker job 태우는 아님(점유 0).
- OOXML markitdown(신규 dep): `pip install markitdown`.
- 샘플은 trivial 말고 **대표 복잡본**(법령·KGS 중심 .hwp/.hwpx, 병합셀/다중시트 xlsx).
사용:
python scripts/poc_office_md.py <file_or_dir> [<file_or_dir> ...]
# 예: 현 코퍼스 백필 후보(doc/docx/xls/xlsx/hwp) 샘플 디렉토리
python scripts/poc_office_md.py ~/poc_samples/
파일: 변환 성공 char/ 행수/heading 지표 + 본문 미리보기.
실패(OfficeMdError) FAILED 출력 이것이 C-5 md_status='failed' 라우팅할 케이스(설계대로).
"""
from __future__ import annotations
import os
import sys
from pathlib import Path
# app/ 를 path 에 (모듈 import 용).
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "app"))
from workers.office_md import SUPPORTED, OfficeMdError, convert_office_to_md, table_fidelity # noqa: E402
def _iter_targets(args: list[str]):
for a in args:
p = Path(a).expanduser()
if p.is_dir():
for child in sorted(p.rglob("*")):
if child.is_file() and child.suffix.lower() in SUPPORTED:
yield child
elif p.is_file():
yield p
else:
print(f" (skip, 경로 없음: {p})")
def main(argv: list[str]) -> int:
if not argv:
print(__doc__)
return 2
targets = list(_iter_targets(argv))
if not targets:
print("변환 대상(.docx/.xlsx/.pptx/.hwp/.hwpx) 없음.")
return 1
ok = fail = 0
for path in targets:
print(f"\n=== {path.name} ({path.suffix.lower()}) ===")
try:
md = convert_office_to_md(path)
except OfficeMdError as e:
fail += 1
print(f" FAILED → (C-5 가 md_status='failed' 라우팅) : {e}")
continue
ok += 1
fid = table_fidelity(md)
print(f" OK chars={fid['chars']} lines={fid['lines']} "
f"table_rows={fid['table_pipe_rows']} (sep≈표수 {fid['table_separator_rows']}) "
f"heading={fid['has_heading']}")
preview = "\n".join(f" | {ln}" for ln in md.splitlines()[:12])
print(preview)
print(f"\n--- 합계: OK {ok} / FAILED {fail} / 총 {len(targets)} ---")
print("표 fidelity 가 낮으면(table_rows 0 등) 해당 포맷은 변환기/필터 재검토 — "
"OOXML↔markitdown, hwp/hwpx↔LibreOffice 경계를 데이터로 확정(C-1).")
return 0 if fail == 0 else 1
if __name__ == "__main__":
raise SystemExit(main(sys.argv[1:]))
+18
View File
@@ -0,0 +1,18 @@
-- rollback_317.sql — plan ds-s1-backend-1 E-3. migration 317(dedup 3컬럼) 되돌림.
--
-- ★ migrations/ 밖에 둔다 — init_db() 자동 스캔(NNN_*.sql) 대상이 아니므로 자동 적용되지 않는다.
-- 수동 실행 전용:
-- docker compose cp scripts/rollback_317.sql postgres:/tmp/rollback_317.sql
-- docker compose exec -T postgres psql -U pkm -d pkm -f /tmp/rollback_317.sql
-- (또는) docker compose exec -T postgres psql -U pkm -d pkm < scripts/rollback_317.sql
--
-- 주의: original_filename / duplicate_of / duplicate_count 데이터 영구 삭제(B-1 채움·B-4 backfill 결과 포함).
-- schema_migrations 의 317 행도 함께 제거해야 재적용(다음 startup)이 가능하다.
-- 전체 복원이 필요하면 E-3 pre-change pg_dump 를 쓴다(이 스크립트는 '컬럼만 빠른 롤백').
ALTER TABLE documents
DROP COLUMN IF EXISTS duplicate_of,
DROP COLUMN IF EXISTS duplicate_count,
DROP COLUMN IF EXISTS original_filename;
DELETE FROM schema_migrations WHERE version = 317;
+30
View File
@@ -0,0 +1,30 @@
#!/usr/bin/env bash
# pre-change pg_dump — plan ds-s1-backend-1 E-3.
# A-1(migration 287) / B-4 backfill 적용 *전* 안전망. repo cp -p 가 아니라 진짜 DB 덤프.
#
# 사용 (GPU 서버, repo 루트에서):
# bash scripts/s1_pre_change_backup.sh # pre-A-1
# bash scripts/s1_pre_change_backup.sh pre-b4 # pre-B-4 (라벨만 다름)
#
# 백업 위치 = repo 밖 (feedback_backup_outside_repo): $HOME/.local/share/ds-s1-backups/
set -euo pipefail
LABEL="${1:-pre-a1}"
DATE="$(date +%Y%m%d-%H%M%S)"
BACKUP_DIR="${BACKUP_DIR:-$HOME/.local/share/ds-s1-backups}"
mkdir -p "$BACKUP_DIR"
OUT="$BACKUP_DIR/pkm-${LABEL}-${DATE}.sql.gz"
echo "[s1-backup] pg_dump pkm → $OUT"
# 단일 pkm DB 덤프(pg_dumpall 아님). gzip 은 redirect(파일명 추측 함정 회피).
docker compose exec -T postgres pg_dump -U pkm -d pkm | gzip > "$OUT"
echo "[s1-backup] done: $(du -h "$OUT" | cut -f1)"
echo -n "[s1-backup] gzip 무결성: "
gzip -t "$OUT" && echo "OK"
echo
echo "[s1-backup] 롤백 옵션:"
echo " (a) 287 컬럼만 되돌림(빠름): scripts/rollback_287.sql 수동 실행"
echo " (b) 전체 복원: gunzip -c '$OUT' | docker compose exec -T postgres psql -U pkm -d pkm"
echo "[s1-backup] 보존 7일 권장. (DR-grade 검증은 ephemeral restore — D5 트랙, 본 안전망 범위 밖.)"
+96
View File
@@ -0,0 +1,96 @@
"""S1-ADD (plan ds-s1-backend-1) B-2 /duplicates shape + D-2 Range 파서 + dedup 상수 단위 검증.
순수 단위(DB 불요). 실행 환경 = app/ 의존성 설치 컨텍스트(devsbx/GPU) 기존
test_s1_dedup_shape.py 동일 부트스트랩. DB 타는 검증(find_canonical/near_dup/엔드포인트)
GPU read-only/통합 매트릭스(E-1)에서.
"""
from __future__ import annotations
import json
import logging
import os
import sys
from pathlib import Path
import pytest
# logs/ 가 운영 daemon 소유일 때 import-time FileHandler PermissionError 방어 (test 한정).
_orig_file_handler = logging.FileHandler
def _safe_file_handler(filename, *args, **kwargs): # type: ignore[no-untyped-def]
try:
return _orig_file_handler(filename, *args, **kwargs)
except PermissionError:
return logging.NullHandler()
logging.FileHandler = _safe_file_handler # type: ignore[assignment]
os.environ.setdefault("DATABASE_URL", "postgresql+asyncpg://test:test@localhost:5432/test")
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "app"))
from api.documents import ( # noqa: E402
DuplicateGroup,
DuplicatesResponse,
_parse_byte_range,
)
from services.dedup import DEDUP_OFF_CHANNELS # noqa: E402
_FIXDIR = Path(os.path.expanduser("~/Documents/code/ds-app/contract/fixtures"))
# ── 1. /duplicates 응답 shape = contract fixture ───────────────────────────────
def test_duplicates_response_shape_matches_total_formula():
# 엔드포인트 정의: total_duplicate_docs = Σ(멤버수-1). fixture 와 동일해야 함.
groups = [
DuplicateGroup(canonical_id=4912, members=[4912, 4977], reason="content_hash"),
DuplicateGroup(canonical_id=5120, members=[5120, 5121, 5260], reason="content_hash"),
]
total_dup = sum(len(g.members) - 1 for g in groups)
resp = DuplicatesResponse(
groups=groups, total_groups=len(groups), total_duplicate_docs=total_dup
)
assert resp.total_groups == 2
assert resp.total_duplicate_docs == 3 # (2-1)+(3-1)
@pytest.mark.skipif(not _FIXDIR.exists(), reason="ds-app contract fixtures 미존재")
def test_duplicates_contract_fixture_decodes():
payload = json.loads((_FIXDIR / "documents_duplicates.json").read_text())
m = DuplicatesResponse.model_validate(payload)
assert m.total_groups == payload["total_groups"]
assert m.total_duplicate_docs == payload["total_duplicate_docs"]
# Σ(멤버수-1) 정의가 fixture total 과 일치(계약 self-consistency).
assert sum(len(g.members) - 1 for g in m.groups) == payload["total_duplicate_docs"]
assert m.groups[0].canonical_id == payload["groups"][0]["canonical_id"]
# ── 2. D-2 Range 파서 (원격 백엔드 pass-through; local 은 FileResponse 자동) ──────
@pytest.mark.parametrize(
"header,size,expected",
[
(None, 1000, (None, None)),
("", 1000, (None, None)),
("bytes=0-99", 1000, (0, 99)),
("bytes=100-", 1000, (100, 999)), # 끝까지
("bytes=-200", 1000, (800, 999)), # suffix: 마지막 200
("bytes=0-99999", 1000, (0, 999)), # end clamp
("bytes=2000-3000", 1000, (None, None)), # start >= size → 무효(전체)
("bytes=abc-def", 1000, (None, None)), # 파싱 실패
("bytes=50-10", 1000, (None, None)), # start>end
("bytes=0-99", 0, (None, None)), # 빈 파일
],
)
def test_parse_byte_range(header, size, expected):
assert _parse_byte_range(header, size) == expected
# ── 3. dedup OFF-whitelist 단일 source ─────────────────────────────────────────
def test_dedup_off_channels_is_law_monitor_only():
# P0-2 결정: 단일 OFF-list = law_monitor (법령 개정본 보존). 확장은 의도적 결정으로만.
assert DEDUP_OFF_CHANNELS == ("law_monitor",)
+168
View File
@@ -0,0 +1,168 @@
"""S1-ADD (plan ds-s1-backend-1) A-4 — call-shape regression + md_status 매핑 동작 검증.
검증 대상 (값이 아니라 *동작*):
1. DB md_status='success' 응답 'completed' 단방향 매핑 (P0-3 silent-fallback 함정 가드의 backend 절반).
- partial/pending/failed/skipped/None 그대로 통과 ('success' 매핑).
2. [S1-ADD] 3필드(original_filename / duplicate_of / duplicate_count) 디코드 + 기본값(duplicate_count=0).
3. (있으면) ds-app contract fixtures 응답 모델로 디코드 계약 shape 비파괴.
주의 테스트는 backend 직렬화 절반만 커버한다.
앱이 'completed' 실제 md-first 렌더 분기로 태우는지(¬extracted_text) S3 fixture-render 테스트가 책임진다
(A 그룹 close = backend green AND S3 render green, owner 명기 plan A-4).
실행 환경: app/ 의존성 설치된 컨텍스트(devsbx/GPU). 순수 단위(DB 불요).
"""
from __future__ import annotations
import json
import logging
import os
import sys
from datetime import datetime, timezone
from pathlib import Path
import pytest
# logs/ 가 운영 daemon(root) 소유일 때 import-time FileHandler PermissionError 방어 (test 한정).
_orig_file_handler = logging.FileHandler
def _safe_file_handler(filename, *args, **kwargs): # type: ignore[no-untyped-def]
try:
return _orig_file_handler(filename, *args, **kwargs)
except PermissionError:
return logging.NullHandler()
logging.FileHandler = _safe_file_handler # type: ignore[assignment]
# api.documents import 가 SQLAlchemy engine init 를 트리거 — dummy DATABASE_URL (실제 connect X).
os.environ.setdefault("DATABASE_URL", "postgresql+asyncpg://test:test@localhost:5432/test")
# tests/ → 프로젝트 루트 → app/
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "app"))
from api.documents import ( # noqa: E402
DocumentDetailResponse,
DocumentListResponse,
DocumentResponse,
)
_NOW = datetime(2026, 6, 4, 8, 0, 0, tzinfo=timezone.utc)
def _base_detail(**overrides) -> dict:
"""DocumentDetailResponse 가 요구하는 전 필드(필수 포함) 완비 dict. overrides 로 일부 교체."""
d = {
"id": 4912,
"file_path": "Engineering/ASME/x.pdf",
"file_format": "pdf",
"file_size": 1338920,
"file_type": "document",
"title": "x",
"ai_domain": "Engineering",
"ai_sub_group": "압력용기",
"ai_tags": ["ASME"],
"ai_summary": "요약",
"document_type": "standard",
"importance": "high",
"ai_confidence": 0.9,
"user_note": None,
"user_tags": None,
"pinned": True,
"ask_includable": True,
"derived_path": None,
"original_format": "pdf",
"conversion_status": "completed",
"is_read": True,
"review_status": "approved",
"edit_url": None,
"preview_status": "ready",
"source_channel": "upload",
"data_origin": "external",
"doc_purpose": "reference",
"extracted_at": _NOW,
"ai_processed_at": _NOW,
"embedded_at": _NOW,
"created_at": _NOW,
"updated_at": _NOW,
# detail 전용
"extracted_text": "원문 폴백 텍스트",
"md_content": "# 제목\n본문",
"md_frontmatter": {},
"md_status": "success",
"md_extraction_engine": "marker",
"md_generated_at": _NOW,
}
d.update(overrides)
return d
# ── 1. ★ md_status 단방향 매핑 (success → completed) ──────────────────────────
def test_db_success_serializes_as_completed():
m = DocumentDetailResponse.model_validate(_base_detail(md_status="success"))
assert m.md_status == "completed", "DB 'success' 는 응답에서 'completed' 로 매핑돼야 함(MD-first 렌더 트리거)"
# model_dump(직렬화) 까지 확인 — 앱이 받는 실제 값.
assert m.model_dump()["md_status"] == "completed"
@pytest.mark.parametrize("raw", ["pending", "processing", "partial", "failed", "skipped", None])
def test_non_success_statuses_pass_through(raw):
m = DocumentDetailResponse.model_validate(_base_detail(md_status=raw))
assert m.md_status == raw, f"'{raw}' 는 매핑 대상 아님 — 그대로 통과해야 함"
def test_mapping_is_read_only_not_a_write_path():
# 이 모델은 응답 직렬화 전용 — write(ORM) 경로가 'completed' 를 DB 로 되쓰지 않는지의 1차 방어선.
# 'completed' 입력이 들어와도(예: fixture) 그대로 'completed' (재매핑 없음, 멱등).
m = DocumentDetailResponse.model_validate(_base_detail(md_status="completed"))
assert m.md_status == "completed"
# ── 2. [S1-ADD] 3필드 디코드 + 기본값 ────────────────────────────────────────
def test_s1add_fields_default_on_list_response():
# DocumentResponse(리스트 행)에도 3필드 존재 — 미제공 시 기본값.
base = {k: v for k, v in _base_detail().items()
if k not in {"extracted_text", "md_content", "md_frontmatter", "md_status",
"md_extraction_engine", "md_generated_at"}}
m = DocumentResponse.model_validate(base)
assert m.duplicate_count == 0
assert m.duplicate_of is None
assert m.original_filename is None
def test_s1add_fields_roundtrip_values():
m = DocumentDetailResponse.model_validate(
_base_detail(original_filename="보고서.docx", duplicate_of=4912, duplicate_count=2)
)
assert m.original_filename == "보고서.docx"
assert m.duplicate_of == 4912
assert m.duplicate_count == 2
# ── 3. ds-app contract fixtures 디코드 (있으면) ──────────────────────────────
_FIXDIR = Path(os.path.expanduser("~/Documents/code/ds-app/contract/fixtures"))
@pytest.mark.skipif(not _FIXDIR.exists(), reason="ds-app contract fixtures 미존재(독립 repo) — 디코드 회귀 skip")
@pytest.mark.parametrize("fname", ["document_detail.json", "document_detail_pending_md.json"])
def test_contract_detail_fixture_decodes(fname):
payload = json.loads((_FIXDIR / fname).read_text())
m = DocumentDetailResponse.model_validate(payload)
# fixture 의 md_status 는 이미 API 어휘('completed'/'pending') — 매핑 멱등.
assert m.md_status == payload["md_status"]
# [S1-ADD] 필드가 fixture 에 있으면 디코드 일치.
if "duplicate_count" in payload:
assert m.duplicate_count == payload["duplicate_count"]
@pytest.mark.skipif(not _FIXDIR.exists(), reason="ds-app contract fixtures 미존재")
def test_contract_list_fixture_decodes():
payload = json.loads((_FIXDIR / "documents_list.json").read_text())
m = DocumentListResponse.model_validate(payload)
assert m.total == payload["total"]
assert len(m.items) == len(payload["items"])