Compare commits

...

26 Commits

Author SHA1 Message Date
hyungi da4a2e81c3 feat(study): 이론공부 홈 — 오늘의 개념·진도·회독 SR (Stage S)
개념문서(가스기사 289) 소비 표면 개선 1단계. /study 허브를 데일리 랜딩으로.
- 마이그 381 study_concept_progress (개념 SR, sr_schedule 공용, documents FK 없음=락 회피)
- concept_curriculum 서비스 + /api/study (curriculum·today-concepts·concepts/{id}/read)
- read 상태 정본 = document_reads (is_read 컬럼 아님), mark_read=회독+SR 입고
- 문제풀이 표면 무접촉·additive

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-01 11:11:30 +09:00
hyungi 966a4315c8 feat(shell): 시안B 슬림 아이콘 레일 — 사이드바 접힘=54px 글로벌 네비(숨김 대신) 2026-06-30 06:29:33 +00:00
hyungi 3c42b7b97a feat(book): 공부도구 배선 — 노트/형광펜/암기카드(clause_study) + 책 리더 패널 2026-06-30 06:26:55 +00:00
hyungi 91ce54c1cd chore(paper): OpenAlex 매치율 측정 스크립트(결론=인용보강 부적합) 2026-06-30 06:20:59 +00:00
hyungi 9ec0a184a0 feat(book): /book 몰입 — 글로벌 분류 사이드바 숨김(더블사이드바 해소) 2026-06-30 06:16:28 +00:00
hyungi a22b2c7647 feat(docs): 관련 문서(유사도 KNN) 엔드포인트+패널 + 법령/지침 splitter 2026-06-30 06:10:11 +00:00
hyungi c44692fddc feat(clause-kb): 코드북 리더 r2 — 세이지 코드북 미감(인덱스/세리프/책내검색/양방향 백링크/페이저)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 05:02:35 +00:00
hyungi 7487739aec fix(clause-kb): 절-문서 이미지를 부모 표준 document_images 로 폴백 해소(docimg 404 수정)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 23:38:37 +00:00
hyungi a8d3af2b62 fix(clause-kb): backlinks 엔드포인트 parent_id ORM 미매핑 → raw SQL 조회 (500 수정)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 23:34:18 +00:00
hyungi 51a7c96b56 feat(clause-kb): over-CAP 절 본문 페이지네이션(~11K tok/page) 2026-06-29 23:20:16 +00:00
hyungi eb83d41ba5 feat(clause-kb): 책 API(절 목차/백링크) + /book/[id] 유기적 책 리더 + persist 스크립트 2026-06-29 23:13:34 +00:00
hyungi 62794b3857 feat(search): ASME 절-KB schema 379 + doc_kind retrieval 필터
- migration 379: documents +parent_id/doc_kind/clause_code/clause_part/clause_order + clause_links/document_tags
- _license_sql 에 doc_kind=standard 필터(절-문서 read/nav 전용, 검색 제외; 전 문서 standard=동작보존)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:56:59 +00:00
hyungi 8cdfe6006d feat(search): cloud-egress 게이트를 단건 문서 fetch 로 확장
GET /api/documents/{id} 가 egress=cloud 토큰일 때 search 와 동일한
cloud-eligibility 게이트(egress allowlist 갭2 + license 제한 B-4)를 통과한
문서만 반환. id 직접 fetch 로 비공개/인프라/개인/restricted 문서를 우회
열람하는 경로 차단 — 부적격은 404(존재 비노출). local 토큰=무회귀.

술어는 retrieval_service.cloud_eligible_doc_sql 로 단일화(_axis_sql
cloud_egress + _license_sql 합성) → search retrieval 과 byte-동일 게이트
공유, 경로별 드리프트 방지. MCP fetch_document 툴의 서버사이드 강제.

e2e: cloud 토큰 적격 Eng 200 / 인프라알림·리디북스memo·개인노트 404,
local 토큰 전부 200(무회귀).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 21:52:41 +00:00
hyungi 3fb613916a feat(search): cloud-egress allowlist gate for cloud consumers (gap2)
클라우드 소비자(Claude/MCP)에 cloud-eligibility allowlist 강제 — DS 접근규격 갭2.
- auth: create_access_token egress claim(기본 local·비파괴) + get_egress_class 의존성
- AxisFilter.cloud_egress + _axis_sql allowlist 술어(토큰 claim 유래·쿼리파라미터 아님=우회불가)
- 규칙: external OR (work ∩ bucket∈{Eng,Safety,Law} ∩ ∉{voice,chat,memo} ∩ ≠memo ∩ user_note없음)
검증(cloud vs local): 인프라알림([Hyungi_NAS] tk-*api)·work/Programming(리디북스) 차단,
work/Engineering(hoop stress·ASME) 통과, external 통과. local=전부(무회귀).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 05:19:23 +00:00
hyungi 0c7211e24b feat(search): domain_bucket scope filter on AxisFilter (include/exclude)
검색 retrieval 에 domain_bucket(377) 포함/제외 필터 추가.
- AxisFilter.domain_buckets(= ANY) / exclude_buckets(<> ALL) + active()
- _axis_sql 2절 — 전 leg documents alias(d / chunk df JOIN) 경유, 미지정시 byte-불변(무회귀)
- search.py: domain_bucket / exclude_bucket Query 파라미터(CSV)
검증: exclude_bucket=News → News 0건(금리 10→0·인공지능 15→0·반도체 11→0),
domain_bucket=Safety → Knowledge/Industrial_Safety 드리프트까지 정규화 포함.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 04:35:12 +00:00
hyungi 94b172e314 ops(ci): boot_smoke 스키마 어서션 max_migration 361→378 (현재 마이그 헤드)
지난 감사(361) 이후 마이그가 378(이번 publish_outbox attempts/failed 포함)까지 전진 →
boot_smoke 스키마 게이트의 하드코딩 기대값 갱신. purge/cand/uq 기대는 동일.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:30:53 +09:00
hyungi 9357d1592d fix(publish): 마이그 번호 377→378 (멀티세션 prod 377_domain_bucket 충돌 회피)
검수 fix 작업 중 다른 세션이 prod 에 377_domain_bucket(ee3b347)을 선점·배포 →
publish_outbox attempts/failed 마이그를 378 로 리넘버(브랜치를 ee3b347 로 rebase).
모델 주석도 mig378 로 정정. 내 fix 8건은 새 prod 커밋과 파일 무충돌(번호만 조정).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:23:16 +09:00
hyungi 832ea72784 fix(publish): backfill 스크립트 after_id 페이징 루프 (overflow 누락 방지)
backfill_publish_* 가 단일 호출(after_id=0, limit=PAGE)이라 PAGE 초과분이 누락(경고만)됐다.
docstring 은 이미 페이지 반복을 명시했으나 스크립트가 미구현. 함수 반환을 (count, last_id)로
바꾸고 3 스크립트를 last_id 기반 while 루프로 전량 처리. PAGE=5000 bounded tx.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:22:36 +09:00
hyungi d26b1150d8 fix(workers): presegment/csb 이벤트루프 blocking I/O to_thread 오프로드
- presegment_worker: fitz open/get_toc(동기 blocking, live 스테이지)를 to_thread 로 — 거대/손상
  PDF 파싱이 같은 루프의 1분 consumer + FastAPI 요청을 수백 ms~초 정지시키던 것 해소.
- csb_collector: 50MB PDF write_bytes + read_bytes(해시)를 to_thread 로 (R5 동형).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:22:36 +09:00
hyungi dcfed09530 fix(workers): marker 200-malformed json transient 분류 + classify summary 가시성
- marker_worker: resp.json() JSONDecodeError(200 응답 truncated/malformed body)가 catch-all
  except 로 _fail(non-retryable) 되던 것을 별도 except 로 raise → queue retry. transient
  연결흔들림이 영구 failed 로 박히는 것 차단.
- classify_worker: ai_summary fallback 비-deferrable 실패를 warning→error 로 격상. ai_summary
  NULL 완료는 digest/briefing 에서 조용히 제외되므로 운영 추적성 보강(best-effort 강등은 유지).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:22:36 +09:00
hyungi 7d882352b8 fix(mineru): 변환/워밍 self-timeout + OOM·행 시 엔진 재워밍 escalate
aio_do_parse 에 자체 타임아웃이 없어 vLLM 행 시 _engine_lock 을 영구 점유 → markdown
변환 전체 마비(컨테이너 재시작 전까지). 클라이언트(marker_worker)는 300s 로 포기하나
서버측 inflight 는 자동 취소 안 됨.

- _run_mineru 를 asyncio.wait_for(convert 600s / warmup 1200s)로 감싸 lock 점유 상한.
- 타임아웃·OOM/CUDA 류 실패 시 _warmup_done 리셋 → 다음 요청 재워밍. 재워밍도 실패하면
  _warmup_error → /ready 503 → healthcheck 재시작으로 escalate(영구 degradation 차단).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:22:36 +09:00
hyungi 7a8aced2a9 fix(workers): file_watcher 파일별 세션 격리 (사이클 전체 롤백 방지)
스캔 전체(Web+PKM)가 단일 세션·단일 commit 이라 한 파일 예외(rglob↔stat 사이 삭제로
FileNotFoundError, flush 오류 등)가 watch_inbox 전체를 raise·롤백 → 그 사이클 등록분을
모두 잃거나, 결정적 poison 파일이 매 사이클 같은 지점에서 중단시켜 그 뒤 파일 영구 미등록.

파일별 독립 세션+commit + try/continue 격리 (news_collector/csb_collector 동형).
file_hash 는 세션 밖에서 계산(커넥션 미점유), 무변경 파일은 쓰기/commit 없음.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:22:36 +09:00
hyungi d50be9f2e7 fix(publish): ingest_study 동시경합을 already_ingested 로 흡수 (500 회피)
같은 client_session_uuid 동시 POST 2건이 최초 멱등체크를 둘 다 통과 → 둘째가
(client_session_uuid, study_topic_id) uq(mig376) 위반으로 IntegrityError → 미처리 500.
데이터는 안전(원자 1-tx 롤백, SR 이중 advance 없음)이나 비우아한 500이 문제.

변이 구간(quiz_session insert ~ commit)을 try/except IntegrityError 로 감싸 승자 결과
재조회 후 already_ingested 반환. uuid 경합이 아닌 진짜 무결성 오류는 재조회 비어 re-raise.
멱등 응답 빌더 _already_ingested 헬퍼 추출(최초 체크 + 경합 흡수 공용).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:22:36 +09:00
hyungi b9f9d88d99 fix(publish): publish_outbox poison row head-of-line block 차단
배치 단일 트랜잭션이라 한 행의 예외가 배치 전체(앞 행 processed_at 포함)를 롤백 →
poison 행이 매 사이클 최저 id 로 재선택되어 후속 발행이 영구 정지. outbox 모델에
재시도/terminal 컬럼이 전무(processing_queue·study_jobs 의 per-item 격리 패턴 미적용).

- mig377: publish_outbox 에 attempts/failed_at 추가
- 워커: 행별 savepoint(begin_nested) 격리 — 예외 시 attempts++, MAX(5) 초과 시
  failed_at 스탬프(terminal) 후 select 제외. 실패 행은 rev 미소모(드문 gap 은 단일
  라이터·커밋순 부여라 viewer since-rev 증분 동기에 무해).

study_publish_enabled=false 기본이라 현재 inert, 발행 활성화(P0-1b) 전 선결.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:22:36 +09:00
hyungi d030a2b7b0 fix(deploy): fresh-DB/DR 부팅 — postgres initdb.d 마운트 제거
빈 볼륨 첫 기동 시 postgres 가 migrations/*.sql 을 psql autocommit 으로 실행해
스키마는 만들되 schema_migrations 스탬프를 안 남김 → fastapi init_db 가 documents
존재로 'fresh' 오판해 baseline 로드를 건너뛰고 001 부터 재replay → CREATE TABLE
users(IF NOT EXISTS 없음) 충돌 → DR/신규환경 부팅 크래시.

fresh-boot 을 init_db 의 baseline + migration runner 단일 경로로 일원화.
기존 prod 볼륨은 비어있지 않아 init scripts 미발동 = 무영향. 관련 docs 정정.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 13:22:36 +09:00
hyungi ee3b347fa7 feat(search): add domain_bucket rollup column (migration 377)
ai_domain(반자유 AI 분류, 드리프트)을 검색 스코프용 7버킷으로 결정적 롤업하는
STORED generated column. News 86% 기본제외 + 도메인 스코프 검색의 토대.
축: ai_domain(routing) 롤업 — category(UI) 아님.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 04:16:30 +00:00
43 changed files with 2184 additions and 265 deletions
+337 -2
View File
@@ -28,7 +28,7 @@ from sqlalchemy.ext.asyncio import AsyncSession
from starlette.requests import ClientDisconnect
from ai.client import AIClient, _load_prompt, parse_json_response
from core.auth import get_current_user
from core.auth import get_current_user, get_egress_class
from core.config import settings
from core.database import async_session, get_session
from core.utils import file_hash
@@ -742,11 +742,31 @@ async def get_document(
doc_id: int,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
egress_class: Annotated[str, Depends(get_egress_class)],
):
"""문서 단건 조회. 본문(extracted_text)·canonical markdown 동봉."""
"""문서 단건 조회. 본문(extracted_text)·canonical markdown 동봉.
cloud egress(갭2): egress=cloud 토큰(예: Claude/MCP)은 search 와 동일한 cloud-eligibility
게이트를 통과한 문서만 열람 가능 — id 직접 fetch 로 비공개/인프라/개인/restricted 문서를
우회 열람하는 경로를 차단한다. 부적격은 404(존재 자체 비노출). local 토큰=게이트 미발동(무회귀).
"""
from sqlalchemy import text as sql_text
from services.search.retrieval_service import cloud_eligible_doc_sql
doc = await session.get(Document, doc_id)
if not doc or doc.deleted_at is not None:
raise HTTPException(status_code=404, detail="문서를 찾을 수 없습니다")
if egress_class == "cloud":
eligible = (
await session.execute(
sql_text(
"SELECT 1 FROM documents WHERE id = :doc_id AND deleted_at IS NULL"
+ cloud_eligible_doc_sql("")
).bindparams(doc_id=doc_id)
)
).first()
if eligible is None:
raise HTTPException(status_code=404, detail="문서를 찾을 수 없습니다")
return DocumentDetailResponse.model_validate(doc)
@@ -1028,6 +1048,19 @@ async def get_document_image_raw(
DocumentImage.image_key == image_key,
)
)
if img is None:
# clause-KB: 절-문서는 부모 표준 이미지를 공유(md_content=부모 슬라이스) → parent_id 폴백.
from sqlalchemy import text as sql_text
_par = (await session.execute(
sql_text("SELECT parent_id FROM documents WHERE id = :id").bindparams(id=doc_id)
)).scalar()
if _par is not None:
img = await session.scalar(
select(DocumentImage).where(
DocumentImage.document_id == _par,
DocumentImage.image_key == image_key,
)
)
if img is None:
raise HTTPException(status_code=404, detail="이미지를 찾을 수 없습니다")
@@ -1801,3 +1834,305 @@ async def analyze_document(
error_code=error_code,
source=source,
)
# ─── ASME 절-지식베이스: 유기적 책 네비 (clause-KB, doc_kind='clause' 자식 문서 기반) ───
class ClauseTocItem(BaseModel):
id: int
clause_code: str | None = None
clause_part: str | None = None
clause_order: int | None = None
title: str | None = None
class ClauseBookResponse(BaseModel):
parent_id: int
parent_title: str | None = None
clauses: list[ClauseTocItem]
@router.get("/{doc_id}/clauses", response_model=ClauseBookResponse)
async def get_document_clauses(
doc_id: int,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
):
"""부모 표준 doc 의 절-문서 목차(유기적 책 TOC). doc_kind='clause' 자식을 clause_order 순 반환.
절-문서는 in_corpus=false + doc_kind='clause'(검색 제외)라 일반 목록/검색엔 안 뜨지만,
이 책-내 네비는 부모 표준에서 자식 절로 진입하는 전용 경로다(ASME 2025판=한 권의 책).
"""
from sqlalchemy import text as sql_text
parent = await session.get(Document, doc_id)
if not parent or parent.deleted_at is not None:
raise HTTPException(status_code=404, detail="문서를 찾을 수 없습니다")
rows = (
await session.execute(
sql_text(
"""
SELECT id, clause_code, clause_part, clause_order, title
FROM documents
WHERE parent_id = :pid AND doc_kind = 'clause' AND deleted_at IS NULL
ORDER BY clause_order
"""
).bindparams(pid=doc_id)
)
).mappings().all()
return ClauseBookResponse(
parent_id=doc_id,
parent_title=parent.title,
clauses=[ClauseTocItem(**dict(r)) for r in rows],
)
class BacklinkRef(BaseModel):
code: str
doc_id: int | None = None # 해소된 절-문서(같은 부모) — dangling 이면 None
title: str | None = None
anchor: str | None = None
ctx: str | None = None
class BacklinksResponse(BaseModel):
doc_id: int
clause_code: str | None = None
parent_id: int | None = None
prev: ClauseTocItem | None = None
next: ClauseTocItem | None = None
forward: list[BacklinkRef] # 이 절이 참조하는 절들
back: list[BacklinkRef] # 이 절을 참조하는 절들
@router.get("/{doc_id}/backlinks", response_model=BacklinksResponse)
async def get_document_backlinks(
doc_id: int,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
):
"""절-문서의 양방향 백링크 + 같은 부모 내 이전/다음 절(유기적 책 흐름)."""
from sqlalchemy import text as sql_text
doc = await session.get(Document, doc_id)
if not doc or doc.deleted_at is not None:
raise HTTPException(status_code=404, detail="문서를 찾을 수 없습니다")
_meta = (await session.execute(sql_text(
"SELECT parent_id, clause_code, clause_order FROM documents WHERE id = :id"
).bindparams(id=doc_id))).mappings().first()
_parent_id = _meta["parent_id"] if _meta else None
_clause_code = _meta["clause_code"] if _meta else None
_clause_order = _meta["clause_order"] if _meta else None
forward = (
await session.execute(
sql_text(
"""
SELECT cl.dst_code AS code, cl.dst_doc_id AS doc_id, cl.anchor, cl.ctx, d.title
FROM clause_links cl
LEFT JOIN documents d ON d.id = cl.dst_doc_id
WHERE cl.src_doc_id = :id
ORDER BY cl.char_off NULLS LAST
LIMIT 300
"""
).bindparams(id=doc_id)
)
).mappings().all()
back = (
await session.execute(
sql_text(
"""
SELECT s.clause_code AS code, cl.src_doc_id AS doc_id, s.title, cl.ctx
FROM clause_links cl
JOIN documents s ON s.id = cl.src_doc_id
WHERE cl.dst_doc_id = :id
ORDER BY s.clause_order NULLS LAST
LIMIT 300
"""
).bindparams(id=doc_id)
)
).mappings().all()
prev = nxt = None
if _parent_id is not None and _clause_order is not None:
prow = (
await session.execute(
sql_text(
"""
SELECT id, clause_code, clause_part, clause_order, title FROM documents
WHERE parent_id = :pid AND doc_kind='clause' AND deleted_at IS NULL
AND clause_order < :ord
ORDER BY clause_order DESC LIMIT 1
"""
).bindparams(pid=_parent_id, ord=_clause_order)
)
).mappings().first()
nrow = (
await session.execute(
sql_text(
"""
SELECT id, clause_code, clause_part, clause_order, title FROM documents
WHERE parent_id = :pid AND doc_kind='clause' AND deleted_at IS NULL
AND clause_order > :ord
ORDER BY clause_order ASC LIMIT 1
"""
).bindparams(pid=_parent_id, ord=_clause_order)
)
).mappings().first()
prev = ClauseTocItem(**dict(prow)) if prow else None
nxt = ClauseTocItem(**dict(nrow)) if nrow else None
return BacklinksResponse(
doc_id=doc_id,
clause_code=_clause_code,
parent_id=_parent_id,
prev=prev,
next=nxt,
forward=[BacklinkRef(**dict(r)) for r in forward],
back=[BacklinkRef(**dict(r)) for r in back],
)
# ─── 관련 문서 (유사도, on-demand pgvector KNN — 저부하·무저장) ───
class RelatedItem(BaseModel):
id: int
title: str | None = None
ai_domain: str | None = None
material_type: str | None = None
year: int | None = None
sim: float | None = None
class RelatedResponse(BaseModel):
doc_id: int
related: list[RelatedItem]
@router.get("/{doc_id}/related", response_model=RelatedResponse)
async def get_related_documents(
doc_id: int,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
limit: int = 8,
same_type: bool = True,
):
"""문서-레벨 임베딩 코사인 최근접 = '관련 문서'. on-demand(저장/배치 없음).
인용그래프가 부적합한 코퍼스(업계 기술기사=인용망 부재)의 대안 연결 레이어.
same_type=true면 같은 material_type 내, false면 전 코퍼스. doc_kind='clause'(절-문서)는 제외.
"""
from sqlalchemy import text as sql_text
lim = max(1, min(limit, 30))
type_clause = "AND d.material_type = src.material_type" if same_type else ""
rows = (
await session.execute(
sql_text(
f"""
WITH src AS (
SELECT embedding, material_type FROM documents WHERE id = :id
)
SELECT d.id, d.title, d.ai_domain, d.material_type, d.facet_year AS year,
round((1 - (d.embedding <=> (SELECT embedding FROM src)))::numeric, 3) AS sim
FROM documents d, src
WHERE d.doc_kind = 'standard' AND d.deleted_at IS NULL
AND d.id <> :id AND d.embedding IS NOT NULL
AND (SELECT embedding FROM src) IS NOT NULL
{type_clause}
ORDER BY d.embedding <=> (SELECT embedding FROM src)
LIMIT :lim
"""
).bindparams(id=doc_id, lim=lim)
)
).mappings().all()
return RelatedResponse(
doc_id=doc_id,
related=[RelatedItem(**{k: r[k] for k in ("id", "title", "ai_domain", "material_type", "year")}, sim=float(r["sim"]) if r["sim"] is not None else None) for r in rows],
)
# ─── 절 공부도구 (노트/형광펜/암기카드) — clause_study ───
class StudyItem(BaseModel):
id: int
kind: str
payload: dict = {}
created_at: datetime | None = None
class StudyListResponse(BaseModel):
doc_id: int
items: list[StudyItem]
class StudyCreate(BaseModel):
kind: str # note | highlight | card
payload: dict = {}
def _parse_payload(p):
import json
if isinstance(p, str):
try:
return json.loads(p)
except Exception:
return {}
return p or {}
@router.get("/{doc_id}/study", response_model=StudyListResponse)
async def list_study(
doc_id: int,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
):
"""절-문서의 공부도구 항목(노트/형광펜/암기카드) 목록."""
from sqlalchemy import text as sql_text
rows = (
await session.execute(
sql_text("SELECT id, kind, payload, created_at FROM clause_study "
"WHERE doc_id = :id ORDER BY created_at DESC").bindparams(id=doc_id)
)
).mappings().all()
return StudyListResponse(
doc_id=doc_id,
items=[StudyItem(id=r["id"], kind=r["kind"], payload=_parse_payload(r["payload"]),
created_at=r["created_at"]) for r in rows],
)
@router.post("/{doc_id}/study", response_model=StudyItem, status_code=201)
async def add_study(
doc_id: int,
body: StudyCreate,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
):
"""노트/형광펜/암기카드 1건 추가."""
import json
from sqlalchemy import text as sql_text
if body.kind not in ("note", "highlight", "card"):
raise HTTPException(status_code=400, detail="kind 는 note/highlight/card")
row = (
await session.execute(
sql_text("INSERT INTO clause_study(doc_id, kind, payload) "
"VALUES (:d, :k, cast(:p AS jsonb)) RETURNING id, kind, payload, created_at")
.bindparams(d=doc_id, k=body.kind, p=json.dumps(body.payload, ensure_ascii=False))
)
).mappings().first()
await session.commit()
return StudyItem(id=row["id"], kind=row["kind"], payload=_parse_payload(row["payload"]),
created_at=row["created_at"])
@router.delete("/{doc_id}/study/{study_id}", status_code=204)
async def delete_study(
doc_id: int,
study_id: int,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
):
from sqlalchemy import text as sql_text
await session.execute(
sql_text("DELETE FROM clause_study WHERE id = :s AND doc_id = :d")
.bindparams(s=study_id, d=doc_id)
)
await session.commit()
+100 -75
View File
@@ -23,6 +23,7 @@ from datetime import datetime, timezone
from fastapi import APIRouter, Depends, Header, HTTPException
from pydantic import BaseModel
from sqlalchemy import select
from sqlalchemy.exc import IntegrityError
from sqlalchemy.ext.asyncio import AsyncSession
from core.config import settings
@@ -66,6 +67,22 @@ class IngestBody(BaseModel):
attempts: list[IngestAttempt]
def _already_ingested(rows) -> dict:
"""이미 적재된 세션들의 캐시 요약(멱등 응답). 최초 멱등체크 + 동시경합 흡수 양쪽에서 사용."""
return {
"status": "already_ingested",
"sessions": [
{
"topic_id": s.study_topic_id,
"correct": s.correct_count,
"wrong": s.wrong_count,
"unsure": s.unsure_count,
}
for s in rows
],
}
def _parse_answered_at(s: str | None, now: datetime) -> datetime:
if not s:
return now
@@ -98,18 +115,7 @@ async def ingest_attempts(
)
).scalars().all()
if existing:
return {
"status": "already_ingested",
"sessions": [
{
"topic_id": s.study_topic_id,
"correct": s.correct_count,
"wrong": s.wrong_count,
"unsure": s.unsure_count,
}
for s in existing
],
}
return _already_ingested(existing)
# pub_id → source_id(내부 질문 id) 해소. deleted tombstone 제외.
pub_ids = list({a.question_pub_id for a in body.attempts})
@@ -156,73 +162,92 @@ async def ingest_attempts(
if not by_topic:
raise HTTPException(status_code=404, detail="해소된 attempt 없음")
summaries = []
for topic_id, items in by_topic.items():
qids = [q.id for (_, q) in items]
qs = StudyQuizSession(
user_id=user_id,
study_topic_id=topic_id,
question_ids=qids,
subject_distribution={},
status="done",
cursor=len(qids),
source="viewer",
client_session_uuid=body.client_session_uuid,
finished_at=now,
created_at=now,
updated_at=now,
)
session.add(qs)
await session.flush() # qs.id
try:
summaries = []
for topic_id, items in by_topic.items():
qids = [q.id for (_, q) in items]
qs = StudyQuizSession(
user_id=user_id,
study_topic_id=topic_id,
question_ids=qids,
subject_distribution={},
status="done",
cursor=len(qids),
source="viewer",
client_session_uuid=body.client_session_uuid,
finished_at=now,
created_at=now,
updated_at=now,
)
session.add(qs)
await session.flush() # qs.id
c = w = u = 0
for a, q in items:
try:
sel, is_corr, outcome = derive_outcome(a.selected_choice, a.is_unsure, q.correct_choice)
except ValueError:
skipped.append(a.question_pub_id) # 선택 없고 unsure 아님 = 무효 → skip
continue
if outcome == "correct":
c += 1
elif outcome == "wrong":
w += 1
elif outcome == "unsure":
u += 1
session.add(
StudyQuestionAttempt(
user_id=user_id,
study_question_id=q.id,
study_topic_id=topic_id,
selected_choice=sel,
correct_choice=q.correct_choice,
is_correct=is_corr,
outcome=outcome,
quiz_session_id=qs.id,
answered_at=_parse_answered_at(a.answered_at, now),
c = w = u = 0
for a, q in items:
try:
sel, is_corr, outcome = derive_outcome(a.selected_choice, a.is_unsure, q.correct_choice)
except ValueError:
skipped.append(a.question_pub_id) # 선택 없고 unsure 아님 = 무효 → skip
continue
if outcome == "correct":
c += 1
elif outcome == "wrong":
w += 1
elif outcome == "unsure":
u += 1
session.add(
StudyQuestionAttempt(
user_id=user_id,
study_question_id=q.id,
study_topic_id=topic_id,
selected_choice=sel,
correct_choice=q.correct_choice,
is_correct=is_corr,
outcome=outcome,
quiz_session_id=qs.id,
answered_at=_parse_answered_at(a.answered_at, now),
)
)
qs.correct_count, qs.wrong_count, qs.unsure_count = c, w, u
await session.flush()
# finalize 무수정 재생(progress/SR/pattern + 4-A/4-B enqueue). 그 후 멱등 마커.
summary = await finalize_session(
session, user_id=user_id, study_topic_id=topic_id, quiz_session_id=qs.id
)
qs.finalized_at = now
summaries.append(
{
"topic_id": topic_id,
"quiz_session_id": qs.id,
"correct": summary.correct,
"wrong": summary.wrong,
"unsure": summary.unsure,
"newly_correct": summary.newly_correct,
"relapsed": summary.relapsed,
"recovered": summary.recovered,
}
)
await session.commit()
except IntegrityError:
# 동시 같은 client_session_uuid 경합 — 상대가 먼저 commit → (client_session_uuid,
# study_topic_id) uq(mig376) 위반. 데이터는 안전(원자 1-tx 전체 롤백 → SR 이중 advance
# 없음). 승자 결과로 graceful 수렴(500 대신 already_ingested). uuid 경합이 아닌 진짜
# 무결성 오류면 재조회가 비어 → re-raise 로 표면화.
await session.rollback()
winner = (
await session.execute(
select(StudyQuizSession).where(
StudyQuizSession.client_session_uuid == body.client_session_uuid
)
)
qs.correct_count, qs.wrong_count, qs.unsure_count = c, w, u
await session.flush()
).scalars().all()
if not winner:
raise
logger.info("study_ingest uuid=%s 동시경합 흡수 → already_ingested", body.client_session_uuid)
return _already_ingested(winner)
# finalize 무수정 재생(progress/SR/pattern + 4-A/4-B enqueue). 그 후 멱등 마커.
summary = await finalize_session(
session, user_id=user_id, study_topic_id=topic_id, quiz_session_id=qs.id
)
qs.finalized_at = now
summaries.append(
{
"topic_id": topic_id,
"quiz_session_id": qs.id,
"correct": summary.correct,
"wrong": summary.wrong,
"unsure": summary.unsure,
"newly_correct": summary.newly_correct,
"relapsed": summary.relapsed,
"recovered": summary.recovered,
}
)
await session.commit()
logger.info(
"study_ingest uuid=%s user=%s sessions=%s skipped=%s",
body.client_session_uuid, user_id, len(summaries), len(skipped),
+7 -1
View File
@@ -15,7 +15,7 @@ from fastapi.responses import JSONResponse
from pydantic import BaseModel
from sqlalchemy.ext.asyncio import AsyncSession
from core.auth import get_current_user
from core.auth import get_current_user, get_egress_class
from core.database import get_session
from core.utils import setup_logger
from models.user import User
@@ -139,6 +139,7 @@ def _build_search_debug(pr: PipelineResult) -> SearchDebug:
async def search(
q: str,
user: Annotated[User, Depends(get_current_user)],
egress_class: Annotated[str, Depends(get_egress_class)],
session: Annotated[AsyncSession, Depends(get_session)],
background_tasks: BackgroundTasks,
mode: str = Query("hybrid", pattern="^(fts|trgm|vector|hybrid)$"),
@@ -211,6 +212,8 @@ async def search(
None, description="안전 자료실 C-1: 관할 필터 (KR/US/EU/JP/GB/INT)"),
year_from: int | None = Query(None, ge=1900, le=2100, description="published_date 연도 하한 (NULL=created_at fallback)"),
year_to: int | None = Query(None, ge=1900, le=2100, description="published_date 연도 상한"),
domain_bucket: str | None = Query(None, description="377: domain_bucket 스코프 CSV (Safety,Engineering,Law,Philosophy,Programming,General,News). domain_bucket = ANY"),
exclude_bucket: str | None = Query(None, description="377: domain_bucket 제외 CSV (예: News). 지식질의 시 News 기본제외용"),
facets: bool = Query(False, description="안전 자료실 C-1 후속: top-K 결과 분류 축 분포(material_type/jurisdiction/version_status)를 응답 facets 에 집계. 미지정=계산/노출 0"),
):
"""문서 검색 — FTS + ILIKE + 벡터 결합 (Phase 3.1 이후 run_search wrapper)"""
@@ -221,6 +224,9 @@ async def search(
jurisdiction=jurisdiction,
year_from=year_from,
year_to=year_to,
domain_buckets=[b.strip() for b in domain_bucket.split(",") if b.strip()] if domain_bucket else None,
exclude_buckets=[b.strip() for b in exclude_bucket.split(",") if b.strip()] if exclude_bucket else None,
cloud_egress=(egress_class == "cloud"),
)
pr = await run_search(
session,
+54
View File
@@ -0,0 +1,54 @@
"""study_concepts API — 이론공부 홈(오늘의 개념 · 진도 · 회독 SR). prefix = /api/study.
문제풀이 표면 무접촉. 개념문서(가스기사 태그) 읽기 집계 + 회독 SR write . 단일 토픽(가스기사=4).
경로: GET /curriculum · GET /today-concepts · POST /concepts/{doc_id}/read.
"""
from __future__ import annotations
from typing import Annotated
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from core.auth import get_current_user
from core.database import get_session
from models.user import User
from services.study import concept_curriculum as cc
router = APIRouter()
# 가스기사 단일 토픽 운영(현행). 다토픽 확장 시 쿼리 파라미터로 승격.
DEFAULT_TOPIC_ID = 4
@router.get("/curriculum")
async def get_curriculum(
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
topic_id: int = DEFAULT_TOPIC_ID,
):
"""과목별 회독 진도 + 개념/문항 복습 due 요약."""
return await cc.curriculum(session, user.id, topic_id)
@router.get("/today-concepts")
async def get_today_concepts(
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
topic_id: int = DEFAULT_TOPIC_ID,
limit: int = 6,
):
"""오늘 공부할 개념(재복습 → 미독 빈출순)."""
return await cc.today_concepts(session, user.id, topic_id, limit)
@router.post("/concepts/{doc_id}/read")
async def post_concept_read(
doc_id: int,
user: Annotated[User, Depends(get_current_user)],
session: Annotated[AsyncSession, Depends(get_session)],
topic_id: int = DEFAULT_TOPIC_ID,
):
"""개념 회독 처리 → 회독 플래그 + SR 입고/전진."""
return await cc.mark_read(session, user.id, topic_id, doc_id)
+11 -2
View File
@@ -31,11 +31,11 @@ def hash_password(password: str) -> str:
return bcrypt.hashpw(password.encode(), bcrypt.gensalt()).decode()
def create_access_token(subject: str, expires_minutes: int | None = None) -> str:
def create_access_token(subject: str, expires_minutes: int | None = None, egress: str = "local") -> str:
minutes = expires_minutes if expires_minutes is not None else ACCESS_TOKEN_EXPIRE_MINUTES
now = datetime.now(timezone.utc)
expire = now + timedelta(minutes=minutes)
payload = {"sub": subject, "exp": expire, "iat": int(now.timestamp()), "type": "access"}
payload = {"sub": subject, "exp": expire, "iat": int(now.timestamp()), "type": "access", "egress": egress}
return jwt.encode(payload, settings.jwt_secret, algorithm=ALGORITHM)
@@ -100,6 +100,15 @@ def verify_totp(code: str, secret: str | None = None) -> bool:
return totp.verify(code)
async def get_egress_class(
credentials: Annotated[HTTPAuthorizationCredentials, Depends(security)],
) -> str:
"""토큰 egress claim -> 'cloud'|'local' (갭2 cloud-egress allowlist). claim 부재=local
(비파괴; 기존 토큰=신뢰/로컬). 쿼리파라미터 아님 -> 호출자가 없음(우회 차단)."""
payload = decode_token(credentials.credentials)
return (payload or {}).get("egress", "local")
async def get_current_user(
credentials: Annotated[HTTPAuthorizationCredentials, Depends(security)],
session: Annotated[AsyncSession, Depends(get_session)],
+3
View File
@@ -33,6 +33,7 @@ from api.study_sessions import router as study_sessions_router
from api.study_topics import router as study_topics_router
from api.study_reminders import router as study_reminders_router
from api.study_cards import router as study_cards_router
from api.study_concepts import router as study_concepts_router
from api.video import router as video_router
from core.config import settings
from core.database import async_session, engine, init_db
@@ -249,6 +250,8 @@ app.include_router(study_reminders_router, prefix="/api/study-reminders", tags=[
app.include_router(study_cards_router, prefix="/api/study-cards", tags=["study-cards"])
# Phase 1: 학습 진행 상태 (review-complete + review-queue). prefix=/api/study-topics 안에 정의됨.
app.include_router(study_question_progress_router, prefix="/api", tags=["study-progress"])
# 이론공부 홈: 오늘의 개념·진도·회독 SR (개념문서 소비 표면, 문제풀이 무접촉).
app.include_router(study_concepts_router, prefix="/api/study", tags=["study-theory"])
# TODO: Phase 5에서 추가
# app.include_router(tasks.router, prefix="/api/tasks", tags=["tasks"])
+4
View File
@@ -56,5 +56,9 @@ class PublishOutbox(Base):
DateTime(timezone=True), default=datetime.now, nullable=False
)
processed_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True))
# mig378: 행별 격리 재시도/terminal. attempts=savepoint 실패 누적, failed_at=MAX 초과 terminal
# (set 시 워커 select 에서 제외 → head-of-line block 방지).
attempts: Mapped[int] = mapped_column(SmallInteger, nullable=False, default=0)
failed_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True))
# 미처리 부분 인덱스 idx(id) WHERE processed_at IS NULL = mig372.
+46
View File
@@ -0,0 +1,46 @@
"""study_concept_progress — 사용자 × 개념문서 단위 간격반복(SR) 진행 (이론공부 홈).
문제 SR(study_question_progress) 개념(이론). '개념문서' = documents (가스기사 태그).
회독( read) 복습 진입, 이후 회독마다 sr_schedule 산술(1·3·7·14·졸업) 공용 전진.
concept_doc_id documents.id 가리키나 FK 미설정 hot 테이블(documents) 회피(clause_study 선례).
"""
from __future__ import annotations
from datetime import datetime
from sqlalchemy import BigInteger, DateTime, ForeignKey, SmallInteger, UniqueConstraint
from sqlalchemy.orm import Mapped, mapped_column
from core.database import Base
class StudyConceptProgress(Base):
__tablename__ = "study_concept_progress"
__table_args__ = (
UniqueConstraint(
"user_id", "concept_doc_id", name="uq_concept_progress_user_doc"
),
)
id: Mapped[int] = mapped_column(BigInteger, primary_key=True)
user_id: Mapped[int] = mapped_column(
BigInteger, ForeignKey("users.id", ondelete="CASCADE"), nullable=False
)
study_topic_id: Mapped[int] = mapped_column(
BigInteger, ForeignKey("study_topics.id", ondelete="CASCADE"), nullable=False
)
# documents.id 참조 — FK 없음(락 회피). 개념문서 삭제 시 고아 행은 read 집계에서 자연 제외.
concept_doc_id: Mapped[int] = mapped_column(BigInteger, nullable=False)
# 복습 큐 (sr_schedule 공용): stage 0~3 = 1·3·7·14일, 4 = 졸업(due_at NULL)
review_stage: Mapped[int | None] = mapped_column(SmallInteger)
due_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True))
last_read_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True))
created_at: Mapped[datetime] = mapped_column(
DateTime(timezone=True), default=datetime.now, nullable=False
)
updated_at: Mapped[datetime] = mapped_column(
DateTime(timezone=True), default=datetime.now, onupdate=datetime.now, nullable=False
)
+37 -2
View File
@@ -76,10 +76,15 @@ class AxisFilter:
jurisdiction: str | None = None
year_from: int | None = None
year_to: int | None = None
domain_buckets: list[str] | None = None # 377: domain_bucket = ANY (도메인 스코프)
exclude_buckets: list[str] | None = None # 377: domain_bucket <> ALL (예: News 제외)
cloud_egress: bool = False # 갭2: 클라우드 소비자 cloud-eligibility allowlist 강제(토큰 claim 유래)
def active(self) -> bool:
return bool(self.material_types or self.jurisdiction
or self.year_from is not None or self.year_to is not None)
or self.year_from is not None or self.year_to is not None
or self.domain_buckets or self.exclude_buckets
or self.cloud_egress)
def _axis_sql(alias: str, af: "AxisFilter | None", params: dict) -> str:
@@ -104,6 +109,22 @@ def _axis_sql(alias: str, af: "AxisFilter | None", params: dict) -> str:
if af.year_to is not None:
cl.append(f"COALESCE({p}published_date, {p}created_at::date) <= make_date(:af_yt, 12, 31)")
params["af_yt"] = af.year_to
if af.domain_buckets:
cl.append(f"{p}domain_bucket = ANY(:af_db)")
params["af_db"] = af.domain_buckets
if af.exclude_buckets:
cl.append(f"{p}domain_bucket <> ALL(:af_xdb)")
params["af_xdb"] = af.exclude_buckets
if af.cloud_egress:
# 갭2 클라우드 egress allowlist(default-deny). restricted 는 _license_sql 별도 차단.
cl.append(
f"({p}data_origin = 'external' OR ("
f"{p}data_origin = 'work' "
f"AND {p}domain_bucket IN ('Engineering','Safety','Law') "
f"AND ({p}source_channel IS NULL OR {p}source_channel::text NOT IN ('voice','chat','memo')) "
f"AND {p}category::text IS DISTINCT FROM 'memo' "
f"AND ({p}user_note IS NULL OR {p}user_note = '')))"
)
return " AND " + " AND ".join(cl)
@@ -121,7 +142,21 @@ def _license_sql(alias: str) -> str:
술어 정의 = license_filter.restricted_exclude_sql 공유(digest/briefing/study 풀이와 단일 source).
"""
from services.search.license_filter import restricted_exclude_sql
return " AND " + restricted_exclude_sql(alias)
_p = (alias + ".") if alias else ""
# ASME clause-KB(379): clause docs (doc_kind='clause') = read/nav/backlink only, excluded from retrieval/digest legs.
return " AND " + restricted_exclude_sql(alias) + f" AND {_p}doc_kind = 'standard'"
def cloud_eligible_doc_sql(alias: str = "") -> str:
"""단일 문서가 cloud 소비자(예: Claude/MCP)에게 노출 가능한가 = search retrieval 과
동일한 egress allowlist(갭2) + license 제한(B-4) 결합 술어. fetch_document(cloud)
search byte-동일 게이트를 공유하도록 단일 source([[feedback_structural_integrity_over_path_discipline]]).
cloud_egress·license leg 모두 bind 파라미터 없는 리터럴 술어라 호출측 추가 params 불요.
주의: _license_sql 소유자 단건 다운로드엔 미적용(a안)이지만, cloud 노출은 구매 전자책
verbatim 누출을 막아야 하므로 여기선 항상 적용 = search 동일(local 토큰은 게이트 미발동).
반환 ' AND (egress allowlist) AND (license)' (alias='' = 컬럼 직접 참조). default-deny."""
return _axis_sql(alias, AxisFilter(cloud_egress=True), {}) + _license_sql(alias)
# 2단계 gate (R2-B1) — SQL string interpolation 직전 final allowlist.
+207
View File
@@ -0,0 +1,207 @@
"""concept_curriculum — 이론공부 홈 재료 (오늘의 개념 · 진도 · 회독 SR).
개념문서 = documents (user_tags = @library/{topic}/{과목}/... , 가스기사). is_read = 회독,
md_content 개수 = 빈출 tier(=3 / =2 / else 1). 회독 SR = study_concept_progress
+ sr_schedule(문제 SR 공용 산술). 읽기 전용 집계 + mark_read(회독+SR 입고) write. LLM 0.
문제풀이 표면 무접촉 여기서 읽는 study_question_progress '문항 due 카운트'( 표시용).
"""
from __future__ import annotations
from datetime import datetime, timezone
from sqlalchemy import func, or_, select, text
from sqlalchemy.ext.asyncio import AsyncSession
from models.document_read import DocumentRead
from models.study_concept_progress import StudyConceptProgress
from models.study_question_progress import StudyQuestionProgress
from models.study_topic import StudyTopic
from services.study.sr_schedule import advance, first_due
# 개념 행 조회 — 태그로 개념문서 필터 + 회독 진행 LEFT JOIN. md_content 는 전송 안 하고
# ★ 유무만 서버측 boolean 으로(홈이 자주 호출돼도 페이로드 최소).
# is_read = document_reads(회독 정본, is_read 컬럼 아님) EXISTS. library unread 와 동일 기준.
_CONCEPT_ROWS_SQL = text(
"""
SELECT d.id AS doc_id,
d.title AS title,
EXISTS (
SELECT 1 FROM document_reads r
WHERE r.document_id = d.id AND r.user_id = :uid
) AS is_read,
(d.md_content LIKE '%★★★%') AS f3,
(d.md_content LIKE '%★★%') AS f2,
split_part(replace(d.user_tags::text, '"', ''), '/', 3) AS subject,
p.review_stage AS review_stage,
p.due_at AS due_at,
p.last_read_at AS last_read_at
FROM documents d
LEFT JOIN study_concept_progress p
ON p.concept_doc_id = d.id AND p.user_id = :uid
WHERE d.user_tags::text LIKE :like
AND d.deleted_at IS NULL
"""
)
async def _topic_name(session: AsyncSession, topic_id: int) -> str | None:
return (
await session.execute(select(StudyTopic.name).where(StudyTopic.id == topic_id))
).scalar_one_or_none()
async def _concept_rows(session: AsyncSession, user_id: int, topic_name: str):
like = f"%@library/{topic_name}/%"
return (
await session.execute(_CONCEPT_ROWS_SQL, {"uid": user_id, "like": like})
).mappings().all()
def _freq(row) -> int:
if row["f3"]:
return 3
if row["f2"]:
return 2
return 1
def _is_due(row, now: datetime) -> bool:
return (
row["due_at"] is not None
and row["due_at"] <= now
and (row["review_stage"] or 0) < 4
)
def _item(row) -> dict:
return {
"doc_id": row["doc_id"],
"title": row["title"],
"subject": row["subject"],
"freq": _freq(row),
"review_stage": row["review_stage"],
"due_at": row["due_at"],
}
async def _question_due_count(session: AsyncSession, user_id: int, topic_id: int, now: datetime) -> int:
"""문항 복습 due (기존 study_question_progress 엔진 재사용, 홈 표시용)."""
return (
await session.execute(
select(func.count())
.select_from(StudyQuestionProgress)
.where(
StudyQuestionProgress.user_id == user_id,
StudyQuestionProgress.study_topic_id == topic_id,
StudyQuestionProgress.due_at.is_not(None),
StudyQuestionProgress.due_at <= now,
or_(
StudyQuestionProgress.review_stage.is_(None),
StudyQuestionProgress.review_stage < 4,
),
)
)
).scalar_one()
async def curriculum(session: AsyncSession, user_id: int, topic_id: int) -> dict:
"""과목별 회독 진도 + 개념/문항 복습 due 요약 (진도 대시보드)."""
name = await _topic_name(session, topic_id)
rows = await _concept_rows(session, user_id, name) if name else []
now = datetime.now(timezone.utc)
subj: dict[str, dict] = {}
for r in rows:
s = subj.setdefault(r["subject"], {"subject": r["subject"], "total": 0, "read": 0})
s["total"] += 1
if r["is_read"]:
s["read"] += 1
total = len(rows)
read = sum(1 for r in rows if r["is_read"])
concept_due = sum(1 for r in rows if _is_due(r, now))
question_due = await _question_due_count(session, user_id, topic_id, now)
return {
"topic_id": topic_id,
"topic_name": name,
"subjects": sorted(subj.values(), key=lambda x: x["subject"]),
"total": total,
"read": read,
"concept_due": concept_due,
"question_due": question_due,
}
async def today_concepts(
session: AsyncSession, user_id: int, topic_id: int, limit: int = 6
) -> dict:
"""오늘 공부할 개념 = 재복습(SR due) 먼저 → 미독(빈출 우선). 졸업/재복습대기 제외."""
name = await _topic_name(session, topic_id)
rows = await _concept_rows(session, user_id, name) if name else []
now = datetime.now(timezone.utc)
due = [r for r in rows if _is_due(r, now)]
due.sort(key=lambda r: r["due_at"])
# 미독 & 아직 SR 큐 진입 전(due_at NULL) → 빈출 높은 순
unread = [r for r in rows if not r["is_read"] and r["due_at"] is None]
unread.sort(key=lambda r: (-_freq(r), r["subject"], r["title"]))
picked = [{**_item(r), "reason": "재복습"} for r in due]
picked += [{**_item(r), "reason": "신규"} for r in unread]
return {
"concepts": picked[:limit],
"due_total": len(due),
"unread_total": len(unread),
}
async def mark_read(
session: AsyncSession, user_id: int, topic_id: int, doc_id: int, now: datetime | None = None
) -> dict:
"""개념 회독 처리 = document_reads(+1) + 회독 SR 입고/전진.
회독 정본 = document_reads(append-only), documents.is_read 컬럼 아님(library unread 정합).
회독 first_due(stage 0, 내일). 이후 회독은 'due 도래(due_at<=now)' 때만 correct 전진
(이른 재열람/다중클릭 과전진 방지). stage 4 졸업 후엔 due_at NULL 이라 전진 없음.
"""
now = now or datetime.now(timezone.utc)
# 회독 로그 append (+1) — 사용자 명시 회독. 자동 아님(엔드포인트 = 명시 POST).
session.add(DocumentRead(user_id=user_id, document_id=doc_id, read_at=now))
prog = (
await session.execute(
select(StudyConceptProgress).where(
StudyConceptProgress.user_id == user_id,
StudyConceptProgress.concept_doc_id == doc_id,
)
)
).scalar_one_or_none()
if prog is None:
stage, due = first_due(now)
prog = StudyConceptProgress(
user_id=user_id,
study_topic_id=topic_id,
concept_doc_id=doc_id,
review_stage=stage,
due_at=due,
last_read_at=now,
)
session.add(prog)
else:
# due 도래 시에만 전진 — 미래 due(재열람 이른 클릭)는 stage 불변, last_read_at 만 갱신.
if prog.due_at is not None and prog.due_at <= now:
res = advance(prog.review_stage, "correct", now)
if res is not None:
prog.review_stage, prog.due_at = res
prog.last_read_at = now
await session.commit()
await session.refresh(prog)
return {"ok": True, "review_stage": prog.review_stage, "due_at": prog.due_at}
+12 -12
View File
@@ -68,10 +68,10 @@ async def enqueue_question_publish(session: AsyncSession, q: Any) -> None:
await enqueue_publish(session, kind=KIND_EXPLANATION, source_id=q.id, payload=expl)
async def backfill_publish_questions(session: AsyncSession, *, after_id: int = 0, limit: int = 200) -> int:
async def backfill_publish_questions(session: AsyncSession, *, after_id: int = 0, limit: int = 200) -> tuple[int, int]:
"""active(미삭제) 문항을 id>after_id 부터 bounded 로 outbox 적재.
반환 = enqueue 문항 (0 이면 ). 셋은 마지막 id 페이지 반복. caller commit.
반환 = (enqueue , 마지막 처리 id). caller ==limit last_id 다음 페이지. caller commit.
"""
rows = (
await session.execute(
@@ -83,7 +83,7 @@ async def backfill_publish_questions(session: AsyncSession, *, after_id: int = 0
).scalars().all()
for q in rows:
await enqueue_question_publish(session, q)
return len(rows)
return len(rows), (rows[-1].id if rows else after_id)
async def enqueue_topic_publish(session: AsyncSession, topic: Any) -> None:
@@ -91,10 +91,10 @@ async def enqueue_topic_publish(session: AsyncSession, topic: Any) -> None:
await enqueue_publish(session, kind=KIND_TOPIC, source_id=topic.id, payload=project_topic(topic))
async def backfill_publish_topics(session: AsyncSession, *, after_id: int = 0, limit: int = 200) -> int:
async def backfill_publish_topics(session: AsyncSession, *, after_id: int = 0, limit: int = 200) -> tuple[int, int]:
"""active(미삭제) 주제를 id>after_id 부터 bounded 로 outbox 적재(S-1 초기 백필).
반환 = enqueue 주제 (0 이면 ). 셋은 마지막 id 페이지 반복. caller commit.
반환 = (enqueue , 마지막 처리 id). caller ==limit last_id 다음 페이지. caller commit.
멱등 = 발행 워커의 (payload_hash, deleted) 디둡이 no-op 재투영 흡수(중복 enqueue 무해).
"""
rows = (
@@ -107,7 +107,7 @@ async def backfill_publish_topics(session: AsyncSession, *, after_id: int = 0, l
).scalars().all()
for t in rows:
await enqueue_topic_publish(session, t)
return len(rows)
return len(rows), (rows[-1].id if rows else after_id)
async def enqueue_card_publish(session: AsyncSession, card: Any) -> None:
@@ -123,10 +123,10 @@ async def enqueue_card_publish(session: AsyncSession, card: Any) -> None:
await enqueue_publish(session, kind=KIND_CARD, source_id=card.id, payload=project_card(card))
async def backfill_publish_cards(session: AsyncSession, *, after_id: int = 0, limit: int = 200) -> int:
async def backfill_publish_cards(session: AsyncSession, *, after_id: int = 0, limit: int = 200) -> tuple[int, int]:
"""검수완료(needs_review=False)·미삭제 카드를 id>after_id 부터 bounded 로 outbox 적재(S-2 초기 백필).
반환 = enqueue 카드 (0 이면 ). 멱등 = 워커 (payload_hash, deleted) 디둡. caller commit.
반환 = (enqueue , 마지막 처리 id). caller ==limit last_id 다음 페이지. 멱등 = 워커 디둡. caller commit.
"""
rows = (
await session.execute(
@@ -142,7 +142,7 @@ async def backfill_publish_cards(session: AsyncSession, *, after_id: int = 0, li
).scalars().all()
for c in rows:
await enqueue_card_publish(session, c)
return len(rows)
return len(rows), (rows[-1].id if rows else after_id)
async def enqueue_card_progress_publish(session: AsyncSession, progress: Any) -> None:
@@ -155,11 +155,11 @@ async def enqueue_card_progress_publish(session: AsyncSession, progress: Any) ->
)
async def backfill_publish_card_progress(session: AsyncSession, *, after_id: int = 0, limit: int = 200) -> int:
async def backfill_publish_card_progress(session: AsyncSession, *, after_id: int = 0, limit: int = 200) -> tuple[int, int]:
"""모든 card progress row 를 id>after_id 부터 bounded 로 outbox 적재(S-4 초기 백필).
필터 없음 = ALL row(due_at NULL sentinel·terminal 포함) due-only 백필은 sentinel 누락.
반환 = enqueue row (0 ). 멱등 = 워커 디둡. caller commit.
반환 = (enqueue , 마지막 처리 id). caller ==limit last_id 다음 페이지. 멱등 = 워커 디둡. caller commit.
"""
rows = (
await session.execute(
@@ -171,4 +171,4 @@ async def backfill_publish_card_progress(session: AsyncSession, *, after_id: int
).scalars().all()
for p in rows:
await enqueue_card_progress_publish(session, p)
return len(rows)
return len(rows), (rows[-1].id if rows else after_id)
+3 -1
View File
@@ -608,7 +608,9 @@ async def process(
except Exception as exc:
if legacy_cfg is not None and is_deferrable_error(exc):
raise StageDeferred(f"macbook_unavailable:{type(exc).__name__}") from exc
logger.warning(f"[summary-fallback] id={document_id}: {exc}")
# ai_summary=NULL 로 완료되면 digest/briefing 이 조용히 제외 → ERROR 로 가시화
# (best-effort 강등 자체는 유지, 운영 추적성만 보강).
logger.error(f"[summary-fallback] id={document_id} ai_summary 미생성: {exc}")
finally:
await client.close()
+5 -2
View File
@@ -140,7 +140,8 @@ async def _download_pdf(url: str, dest: Path) -> int:
if len(resp.content) > _MAX_PDF_BYTES:
raise FeedError(f"PDF 크기 초과 ({len(resp.content)} bytes): {url}")
dest.parent.mkdir(parents=True, exist_ok=True)
dest.write_bytes(resp.content)
# 최대 50MB PDF write 는 동기 blocking — 이벤트루프 점유 회피 to_thread (R5 동형).
await asyncio.to_thread(dest.write_bytes, resp.content)
return len(resp.content)
@@ -190,9 +191,11 @@ async def _ingest_pdf(session, page_slug: str, pdf_url: str) -> bool:
dest = Path(settings.nas_mount_path) / rel_path
size = await _download_pdf(pdf_url, dest)
# 50MB PDF read + sha256 는 동기 blocking(I/O+CPU) — 이벤트루프 점유 회피 to_thread (R5 동형).
file_hash = await asyncio.to_thread(lambda: hashlib.sha256(dest.read_bytes()).hexdigest())
doc = Document(
file_path=rel_path,
file_hash=hashlib.sha256(dest.read_bytes()).hexdigest(),
file_hash=file_hash,
file_format="pdf",
file_size=size,
file_type="immutable",
+96 -82
View File
@@ -251,104 +251,118 @@ async def watch_inbox():
for extra_path in settings.additional_watch_targets:
targets.append((extra_path, "library"))
async with async_session() as session:
# ─── Web/ 트랙 (devonagent) — DEVONthink Smart Rule 이 떨군 .html 만 진입 ───
if web_root.exists():
# rglob NFS 디렉토리 walk(blocking stat 다발)를 off-thread 로 수집 (R5).
for file_path in await asyncio.to_thread(lambda: list(web_root.rglob("*.html"))):
if not file_path.is_file() or should_skip(file_path):
continue
rel_path = str(file_path.relative_to(nas_root))
added, _ = await _ingest_web_file(session, file_path, rel_path)
# 파일별 독립 세션+commit 으로 격리 — 한 파일 실패(예: rglob↔stat 사이 삭제로 FileNotFoundError,
# flush 오류)가 watch_inbox 전체를 raise·롤백해 그 사이클 등록분을 모두 잃거나, 결정적 poison
# 파일이 매 사이클 같은 지점에서 중단시키는 것을 차단 (news_collector/csb_collector 와 동형).
# ─── Web/ 트랙 (devonagent) — DEVONthink Smart Rule 이 떨군 .html 만 진입 ───
if web_root.exists():
# rglob NFS 디렉토리 walk(blocking stat 다발)를 off-thread 로 수집 (R5).
for file_path in await asyncio.to_thread(lambda: list(web_root.rglob("*.html"))):
if not file_path.is_file() or should_skip(file_path):
continue
rel_path = str(file_path.relative_to(nas_root))
try:
async with async_session() as session:
added, _ = await _ingest_web_file(session, file_path, rel_path)
await session.commit()
new_count += added
# ─── PKM 트랙 (기존 drive_sync) ─────────────────────────────────────────
for sub, expected_category in targets:
scan_root = pkm_root / sub
if not scan_root.exists():
except Exception as e:
logger.warning("[Web] 파일 처리 실패 skip path=%s: %s", rel_path, e)
continue
# 안전 자료실 A-2/B-4 — 타깃 폴더 기반 (material, jurisdiction, license)
target_mt, target_jur, target_license = _TARGET_AXIS.get(
Path(sub).name, (None, None, None)
# ─── PKM 트랙 (기존 drive_sync) ─────────────────────────────────────────
for sub, expected_category in targets:
scan_root = pkm_root / sub
if not scan_root.exists():
continue
# 안전 자료실 A-2/B-4 — 타깃 폴더 기반 (material, jurisdiction, license)
target_mt, target_jur, target_license = _TARGET_AXIS.get(
Path(sub).name, (None, None, None)
)
# NFS 디렉토리 walk(blocking) off-thread 수집 (R5).
for file_path in await asyncio.to_thread(lambda: list(scan_root.rglob("*"))):
if not file_path.is_file() or should_skip(file_path):
continue
category, needs_conversion, next_stage = _route_media(
file_path, expected_category
)
# NFS 디렉토리 walk(blocking) off-thread 수집 (R5).
for file_path in await asyncio.to_thread(lambda: list(scan_root.rglob("*"))):
if not file_path.is_file() or should_skip(file_path):
continue
# audio/video 폴더에 엉뚱한 확장자가 들어왔거나 Inbox 에
# audio/video 가 잘못 떨어진 경우 — 이 라운드에서 아예 skip
if category is None and next_stage is None:
continue
category, needs_conversion, next_stage = _route_media(
file_path, expected_category
)
# audio/video 폴더에 엉뚱한 확장자가 들어왔거나 Inbox 에
# audio/video 가 잘못 떨어진 경우 — 이 라운드에서 아예 skip
if category is None and next_stage is None:
continue
rel_path = str(file_path.relative_to(nas_root))
rel_path = str(file_path.relative_to(nas_root))
try:
# GB 파일 SHA-256 은 이벤트 루프를 점유 → 같은 루프의 모든 1분 주기 consumer
# + FastAPI 요청이 수십초~분 동시 정지. to_thread 오프로드. 스캔 루프가 이미
# 순차라 file_hash 는 한 번에 하나만 실행(직렬화) — 병렬 해싱 X = NFS 2.5GbE
# 대역폭·버퍼 메모리 blowup 방지 (R5).
# 대역폭·버퍼 메모리 blowup 방지 (R5). 세션 밖에서 계산(커넥션 미점유).
fhash = await asyncio.to_thread(file_hash, file_path)
result = await session.execute(
select(Document).where(Document.file_path == rel_path)
)
existing = result.scalar_one_or_none()
if existing is None:
ext = file_path.suffix.lstrip(".").lower() or "unknown"
doc = Document(
file_path=rel_path,
file_hash=fhash,
file_format=ext,
file_size=file_path.stat().st_size,
file_type="immutable",
title=file_path.stem,
source_channel="drive_sync",
category=category,
needs_conversion=needs_conversion,
# 안전 자료실 A-2/B-4 — watch 타깃 매핑 (KGS=law/KR 등, 비대상=NULL)
material_type=target_mt,
jurisdiction=target_jur,
async with async_session() as session:
result = await session.execute(
select(Document).where(Document.file_path == rel_path)
)
# B-4 — 타깃 폴더 license 주입(restricted 포함, 비대상=미주입). classify 는
# material_type IS NULL 일 때만 제안 + extract_meta 미기록이라 주입 보존.
if target_license:
doc.extract_meta = {"license": dict(target_license)}
session.add(doc)
await session.flush()
existing = result.scalar_one_or_none()
if next_stage:
await enqueue_stage(session, doc.id, next_stage)
new_count += 1
if existing is None:
ext = file_path.suffix.lstrip(".").lower() or "unknown"
doc = Document(
file_path=rel_path,
file_hash=fhash,
file_format=ext,
file_size=file_path.stat().st_size,
file_type="immutable",
title=file_path.stem,
source_channel="drive_sync",
category=category,
needs_conversion=needs_conversion,
# 안전 자료실 A-2/B-4 — watch 타깃 매핑 (KGS=law/KR 등, 비대상=NULL)
material_type=target_mt,
jurisdiction=target_jur,
)
# B-4 — 타깃 폴더 license 주입(restricted 포함, 비대상=미주입). classify 는
# material_type IS NULL 일 때만 제안 + extract_meta 미기록이라 주입 보존.
if target_license:
doc.extract_meta = {"license": dict(target_license)}
session.add(doc)
await session.flush()
elif existing.file_hash != fhash:
existing.file_hash = fhash
existing.file_size = file_path.stat().st_size
# 기존 문서에 category/quarantine flag 가 비어있으면 보정
if existing.category is None and category is not None:
existing.category = category
if needs_conversion and not getattr(existing, "needs_conversion", False):
existing.needs_conversion = True
# B-4 — 축/license 보정(B-4 이전 적재분이 재변경 시): material 미설정 시 주입,
# license 부재 시에만 merge 주입(clobber 회피 — 기존 extract_meta 키 보존).
if existing.material_type is None and target_mt is not None:
existing.material_type = target_mt
existing.jurisdiction = target_jur
if target_license and not (existing.extract_meta or {}).get("license"):
meta = dict(existing.extract_meta or {})
meta["license"] = dict(target_license)
existing.extract_meta = meta
if next_stage:
await enqueue_stage(session, doc.id, next_stage)
await session.commit()
new_count += 1
if next_stage:
await enqueue_stage(session, existing.id, next_stage)
changed_count += 1
elif existing.file_hash != fhash:
existing.file_hash = fhash
existing.file_size = file_path.stat().st_size
# 기존 문서에 category/quarantine flag 가 비어있으면 보정
if existing.category is None and category is not None:
existing.category = category
if needs_conversion and not getattr(existing, "needs_conversion", False):
existing.needs_conversion = True
# B-4 — 축/license 보정(B-4 이전 적재분이 재변경 시): material 미설정 시 주입,
# license 부재 시에만 merge 주입(clobber 회피 — 기존 extract_meta 키 보존).
if existing.material_type is None and target_mt is not None:
existing.material_type = target_mt
existing.jurisdiction = target_jur
if target_license and not (existing.extract_meta or {}).get("license"):
meta = dict(existing.extract_meta or {})
meta["license"] = dict(target_license)
existing.extract_meta = meta
await session.commit()
if next_stage:
await enqueue_stage(session, existing.id, next_stage)
await session.commit()
changed_count += 1
# else: 무변경 → 쓰기 없음 (세션 자동 닫힘, commit 불요)
except Exception as e:
logger.warning("[PKM] 파일 처리 실패 skip path=%s: %s", rel_path, e)
continue
if new_count or changed_count:
logger.info(f"[Inbox+§3] 새 파일 {new_count}건, 변경 파일 {changed_count}건 등록")
+5
View File
@@ -300,6 +300,11 @@ async def _process_single(
f"[marker] transient error id={document_id} kind={type(exc).__name__}: {exc}"
)
raise
except json.JSONDecodeError as exc:
# 200 응답의 truncated/malformed body — 연결 흔들림 등 transient. _fail(non-retryable)
# 로 박지 말고 raise → queue retry (max_attempts 까지). 진짜 손상이면 재시도 후 failed.
logger.warning(f"[marker] malformed json body (200) id={document_id}: {exc}")
raise
except Exception as exc:
logger.exception(f"[marker] unexpected error id={document_id}: {exc}")
await _fail(session, document_id, str(exc)[:1000])
+9 -3
View File
@@ -497,12 +497,18 @@ async def process(document_id: int, session: AsyncSession) -> None:
logger.warning(f"[presegment] id={document_id} file not found ({raw}) → extract")
return
import asyncio
import fitz # PyMuPDF — extract_worker/marker_worker 와 동일 의존
def _read_toc(path: str):
# fitz open/get_toc 는 동기 blocking — live 스테이지라 이벤트루프(같은 루프의 1분 consumer +
# FastAPI 요청) 점유 회피 위해 to_thread 오프로드(거대/손상 PDF 파싱 수백 ms~초).
with fitz.open(path) as pdf:
return pdf.page_count, (pdf.get_toc(simple=True) or [])
try:
with fitz.open(str(source)) as pdf:
page_count = pdf.page_count
toc = pdf.get_toc(simple=True) or []
page_count, toc = await asyncio.to_thread(_read_toc, str(source))
except Exception as exc:
# PDF 손상 등 — 분할 불가. 단일문서로 통과(extract 가 PyMuPDF/OCR 로 재시도하며 가시화).
logger.warning(
+79 -46
View File
@@ -28,6 +28,8 @@ logger = setup_logger("study_publish_worker")
BATCH_SIZE = 500
# pg_advisory_xact_lock 전역 단일 라이터 키(발행 워커 전용 임의 상수, 타 advisory 락과 비충돌).
ADVISORY_LOCK_KEY = 838201
# 행별 격리 재시도 상한 — 초과 시 failed_at 스탬프(terminal)로 select 에서 제외.
MAX_OUTBOX_ATTEMPTS = 5
async def consume_publish_outbox() -> None:
@@ -46,11 +48,15 @@ async def consume_publish_outbox() -> None:
max_rev = int(
(await session.execute(select(func.coalesce(func.max(Published.rev), 0)))).scalar() or 0
)
# 3) 미처리 outbox 를 커밋순(id)으로.
# 3) 미처리 outbox 를 커밋순(id)으로. failed_at(terminal) 은 제외 — poison 행이
# head-of-line 을 영구 점유하지 않게 함.
rows = (
await session.execute(
select(PublishOutbox)
.where(PublishOutbox.processed_at.is_(None))
.where(
PublishOutbox.processed_at.is_(None),
PublishOutbox.failed_at.is_(None),
)
.order_by(PublishOutbox.id.asc())
.limit(BATCH_SIZE)
)
@@ -60,59 +66,86 @@ async def consume_publish_outbox() -> None:
now = datetime.now(timezone.utc)
published_count = 0
failed_count = 0
for ob in rows:
existing = (
await session.execute(
select(Published).where(
Published.kind == ob.kind,
Published.source_id == ob.source_id,
)
)
).scalar_one_or_none()
try:
# 행 단위 savepoint 격리 — 한 행의 예외가 배치 전체(앞 행 processed_at 포함)를
# 롤백해 poison 행이 다음 사이클에 다시 최저 id 로 선택되는 무한 재선택을 차단.
async with session.begin_nested():
existing = (
await session.execute(
select(Published).where(
Published.kind == ob.kind,
Published.source_id == ob.source_id,
)
)
).scalar_one_or_none()
# (payload_hash, deleted) 디둡 — no-op 재투영은 rev 안 올림.
if (
existing is not None
and existing.payload_hash == ob.payload_hash
and existing.deleted == ob.deleted
):
ob.processed_at = now
# (payload_hash, deleted) 디둡 — no-op 재투영은 rev 안 올림.
is_noop = (
existing is not None
and existing.payload_hash == ob.payload_hash
and existing.deleted == ob.deleted
)
if is_noop:
ob.processed_at = now
else:
new_rev = max_rev + 1
if existing is None:
session.add(
Published(
kind=ob.kind,
source_id=ob.source_id,
pub_id=uuid.uuid4().hex,
payload=ob.payload,
payload_hash=ob.payload_hash,
schema_version=ob.schema_version,
rev=new_rev,
deleted=ob.deleted,
created_at=now,
updated_at=now,
)
)
else:
existing.payload = ob.payload
existing.payload_hash = ob.payload_hash
existing.schema_version = ob.schema_version
existing.deleted = ob.deleted
existing.rev = new_rev
existing.updated_at = now
ob.processed_at = now
# 배치 내 동일 (kind, source_id) 후속 행이 직전 반영을 보도록 flush(최신 승).
await session.flush()
except Exception as row_err:
# savepoint 롤백 = 이 행의 쓰기(processed_at 포함) 취소. attempts/failed_at 만
# 바깥 트랜잭션에 누적돼 최종 commit 으로 영속(영구 재선택 방지).
ob.attempts = (ob.attempts or 0) + 1
if ob.attempts >= MAX_OUTBOX_ATTEMPTS:
ob.failed_at = now
failed_count += 1
logger.error(
"publish_outbox_row_terminal id=%s kind=%s source_id=%s attempts=%s: %s",
ob.id, ob.kind, ob.source_id, ob.attempts, row_err,
)
else:
logger.warning(
"publish_outbox_row_retry id=%s kind=%s source_id=%s attempts=%s: %s",
ob.id, ob.kind, ob.source_id, ob.attempts, row_err,
)
continue
max_rev += 1
if existing is None:
session.add(
Published(
kind=ob.kind,
source_id=ob.source_id,
pub_id=uuid.uuid4().hex,
payload=ob.payload,
payload_hash=ob.payload_hash,
schema_version=ob.schema_version,
rev=max_rev,
deleted=ob.deleted,
created_at=now,
updated_at=now,
)
)
else:
existing.payload = ob.payload
existing.payload_hash = ob.payload_hash
existing.schema_version = ob.schema_version
existing.deleted = ob.deleted
existing.rev = max_rev
existing.updated_at = now
ob.processed_at = now
# 배치 내 동일 (kind, source_id) 후속 행이 직전 반영을 보도록 flush(최신 승).
await session.flush()
published_count += 1
# savepoint 커밋 성공 시에만 rev 카운터 전진(실패 행은 rev 미소모 → 드물게 gap,
# 단일 라이터·커밋순 부여라 viewer since-rev 증분 동기 정합엔 무해).
if not is_noop:
max_rev = new_rev
published_count += 1
await session.commit()
logger.info(
"publish_outbox_drained scanned=%s published=%s max_rev=%s",
"publish_outbox_drained scanned=%s published=%s failed=%s max_rev=%s",
len(rows),
published_count,
failed_count,
max_rev,
)
except Exception as e:
+7 -1
View File
@@ -3,7 +3,13 @@ services:
image: pgvector/pgvector:pg16
volumes:
- pgdata:/var/lib/postgresql/data
- ./migrations:/docker-entrypoint-initdb.d
# ★ 2026-06-29 fresh-DB/DR 부팅 fix: initdb.d 마운트 제거(기존 `./migrations:/docker-entrypoint-initdb.d`).
# 빈 볼륨 첫 기동 시 postgres 엔트리포인트가 migrations/*.sql(001~) 을 psql autocommit 으로 실행해
# 스키마는 만들되 schema_migrations 스탬프는 안 남김(runner 만 생성) → fastapi init_db 가 documents
# 존재로 'fresh' 를 오판해 baseline(_load_baseline_if_fresh) 로드를 건너뛰고, 빈 schema_migrations
# 로 001 부터 재replay → `CREATE TABLE users`(IF NOT EXISTS 없음) 충돌 → 부팅 크래시(DR/신규환경).
# fresh-boot 은 init_db 의 baseline 적재 + migration runner 단일 경로로 일원화(설계 의도). 기존 prod
# 볼륨은 비어있지 않아 init scripts 가 애초에 미발동 → 무영향.
environment:
POSTGRES_DB: pkm
POSTGRES_USER: pkm
+1 -1
View File
@@ -1094,7 +1094,7 @@ services:
image: pgvector/pgvector:pg16
volumes:
- pgdata:/var/lib/postgresql/data
- ./migrations:/docker-entrypoint-initdb.d
# initdb.d 마운트 제거(2026-06-29): fresh-boot 은 fastapi init_db+baseline 단일 경로.
environment:
POSTGRES_DB: pkm
POSTGRES_USER: pkm
+2 -2
View File
@@ -71,7 +71,7 @@ GPU 서버의 NFS mount (`/proc/mounts` 실측):
| 컨테이너 | 마운트 | 모드 | 비고 |
|---|---|---|---|
| postgres | `pgdata:/var/lib/postgresql/data` + `./migrations:/docker-entrypoint-initdb.d` | rw | DB 본체 named volume |
| postgres | `pgdata:/var/lib/postgresql/data` | rw | DB 본체 named volume (initdb.d 마운트는 2026-06-29 제거 — 아래 관찰) |
| kordoc-service | `${NAS}/Document_Server:/documents` | **ro** | PDF/HWP parse |
| ocr-service | `${NAS}/Document_Server:/documents` + `ocr_models:/root/.cache` | **ro** + rw | |
| marker-service | `${NAS}/Document_Server:/documents` + `marker_models:/models` | **ro** + rw | PDF→markdown |
@@ -84,7 +84,7 @@ GPU 서버의 NFS mount (`/proc/mounts` 실측):
**관찰**:
- worker 컨테이너 (kordoc/ocr/marker/stt) 는 모두 NAS **read-only** 마운트 → 원본 안전.
- fastapi 만 NAS **rw** → 업로드/preview/extracted_images 쓰기 단일 책임.
- `./migrations` 이 postgres 의 `docker-entrypoint-initdb.d` fastapi 의 `/app/migrations` 양쪽에 마운트. 단 실제 migration runner 는 fastapi `init_db()` 만 사용 (postgres init scripts 는 첫 생성 시만 실행 → 효과 X, 안전).
- `./migrations` fastapi 의 `/app/migrations` 마운트. migration runner 는 fastapi `init_db()` 단일 경로. (~2026-06-29: postgres `docker-entrypoint-initdb.d` 마운트 제거. 기존엔 "첫 생성 시만 실행 → 효과 X" 로 봤으나, 빈 볼륨 첫 기동 시 postgres 가 migrations/*.sql 을 실제 실행해 스키마는 만들되 schema_migrations 스탬프를 안 남겨 → init_db 의 baseline fresh 판정을 깨고 부팅 크래시 유발. fresh-DB/DR 부팅을 init_db+baseline 단일 경로로 일원화.)
## 정책 정리
@@ -0,0 +1,45 @@
<script>
// 관련 문서 (유사도) — 문서 레벨 임베딩 KNN. 자기완결: docId 받아 /related 조회.
import { onMount } from 'svelte';
import { api } from '$lib/api';
let { documentId } = $props();
let items = $state([]);
let loaded = $state(false);
const KIND = { law: '법령', guide: '지침', paper: '논문', standard: '표준', incident: '사례' };
onMount(async () => {
try {
const r = await api(`/documents/${documentId}/related?limit=6`);
items = r?.related ?? [];
} catch (e) { /* silent */ }
finally { loaded = true; }
});
</script>
{#if items.length}
<div class="rel">
<div class="lab">관련 문서</div>
{#each items as it (it.id)}
<a class="ri" href={`/documents/${it.id}`}>
<span class="rt">{it.title}</span>
<span class="rm">
{#if it.material_type && KIND[it.material_type]}<span class="kind">{KIND[it.material_type]}</span>{/if}
<span class="rs">{Math.round((it.sim ?? 0) * 100)}</span>
</span>
</a>
{/each}
</div>
{/if}
<style>
.rel { background: var(--surface); border: 1px solid var(--border); border-radius: 14px; padding: 13px; }
.lab { font-size: 10.5px; font-weight: 700; color: var(--text-dim); letter-spacing: .4px; margin-bottom: 8px; }
.ri { display: flex; align-items: baseline; gap: 8px; padding: 5px 6px; border-radius: 7px; text-decoration: none; }
.ri:hover { background: var(--surface-hover, #ecf0e8); }
.rt { flex: 1; font-size: 12px; line-height: 1.4; color: var(--text); overflow: hidden; display: -webkit-box; -webkit-line-clamp: 2; -webkit-box-orient: vertical; }
.rm { flex-shrink: 0; display: flex; align-items: center; gap: 5px; }
.kind { font-size: 9px; font-weight: 700; color: var(--accent-hover, #3d7256); background: #e3efe2; border: 1px solid #cfe3cd; border-radius: 4px; padding: 0 4px; }
.rs { font-size: 10.5px; font-family: ui-monospace, Menlo, monospace; color: var(--faint, #9aa090); }
</style>
@@ -0,0 +1,33 @@
<script>
// 시안 B — 글로벌 네비 슬림 아이콘 레일 (분류 사이드바 접힘 상태). 앱 토큰 사용.
import { page } from '$app/stores';
import { Home, FolderTree, Newspaper, StickyNote, Hash, GraduationCap, MessageCircle, Inbox, CalendarCheck } from 'lucide-svelte';
const items = [
{ href: '/', icon: Home, label: '홈', exact: true },
{ href: '/library', icon: FolderTree, label: '문서' },
{ href: '/news', icon: Newspaper, label: '뉴스' },
{ href: '/memos', icon: StickyNote, label: '메모' },
{ href: '/clause', icon: Hash, label: '절' },
{ href: '/events', icon: CalendarCheck, label: '일정' },
{ href: '/study', icon: GraduationCap, label: '공부' },
{ href: '/chat', icon: MessageCircle, label: '이드' },
{ href: '/inbox', icon: Inbox, label: '편지함' },
];
let path = $derived($page.url.pathname);
const active = (it) => (it.exact ? path === it.href : path.startsWith(it.href));
</script>
<nav class="flex flex-col items-center gap-1 py-2 h-full overflow-y-auto bg-sidebar">
{#each items as it (it.href)}
{@const Icon = it.icon}
<a
href={it.href}
title={it.label}
class="flex flex-col items-center justify-center gap-0.5 w-12 h-[46px] rounded-lg text-dim hover:bg-surface-hover hover:text-accent transition-colors {active(it) ? 'bg-surface-active text-accent font-semibold' : ''}"
>
<Icon size={17} strokeWidth={1.75} />
<span class="text-[8.5px] leading-none tracking-tight">{it.label}</span>
</a>
{/each}
</nav>
+4 -3
View File
@@ -11,6 +11,7 @@
import { queueOverview } from '$lib/stores/queueOverview';
import { MACHINE_STATE_LABEL, machineChipClass } from '$lib/utils/queueDisplay';
import Sidebar from '$lib/components/Sidebar.svelte';
import SlimRail from '$lib/components/SlimRail.svelte';
import SystemStatusDot from '$lib/components/SystemStatusDot.svelte';
import QueueDrawer from '$lib/components/QueueDrawer.svelte';
import QuickMemoButton from '$lib/components/QuickMemoButton.svelte';
@@ -21,7 +22,7 @@
const PUBLIC_PATHS = ['/login', '/setup', '/__styleguide'];
const NO_CHROME_PATHS = ['/login', '/setup', '/__styleguide'];
// /news = 풀스크린 브리핑 → 데스크탑 상시 사이드바 없음
const NO_SIDEBAR_PATHS = ['/news'];
const NO_SIDEBAR_PATHS = ['/news', '/book']; // /book = 책 몰입(글로벌 분류 트리 숨김, 상단 네비 유지)
// toast 의미 토큰 매핑 (A-8 B3)
const TOAST_CLASS = {
@@ -198,8 +199,8 @@
<!-- 메인: 데스크탑 상시 사이드바 + 콘텐츠 -->
<div class="flex-1 min-h-0 flex">
{#if showSidebar}
<aside class="hidden lg:block shrink-0 overflow-hidden transition-[width] duration-200 ease-out {sidebarCollapsed ? 'w-0 border-r-0' : 'w-sidebar border-r border-default'}">
<Sidebar />
<aside class="hidden lg:block shrink-0 overflow-hidden transition-[width] duration-200 ease-out {sidebarCollapsed ? 'w-14 border-r border-default' : 'w-sidebar border-r border-default'}">
{#if sidebarCollapsed}<SlimRail />{:else}<Sidebar />{/if}
</aside>
{/if}
<main class="flex-1 min-w-0 overflow-auto">
+281
View File
@@ -0,0 +1,281 @@
<script>
// ASME/법령 절-KB — 코드북·공부 리더 (r2). parent 표준/법령을 한 권의 책처럼.
// 좌 인덱스(Part/章→절/조) · 중 본문(MarkdownDoc=공식·표·이미지) · breadcrumb·이전다음·양방향 백링크.
import { onMount, tick } from 'svelte';
import { page } from '$app/stores';
import { goto } from '$app/navigation';
import { api } from '$lib/api';
import MarkdownDoc from '$lib/components/MarkdownDoc.svelte';
let parentId = $state(null);
let parentTitle = $state('');
let clauses = $state([]);
let selectedId = $state(null);
let clauseDoc = $state(null);
let links = $state(null);
let expanded = $state({});
let loading = $state(false);
let q = $state('');
// 공부도구 (노트/형광펜/암기카드) — clause_study
let studyItems = $state([]);
let studyOpen = $state(false);
let noteDraft = $state('');
const KLABEL = { note: '노트', highlight: '형광펜', card: '암기카드' };
async function loadStudy(id) {
try { const r = await api(`/documents/${id}/study`); studyItems = r?.items ?? []; }
catch { studyItems = []; }
}
async function addStudy(kind, payload) {
if (!selectedId) return;
try { await api(`/documents/${selectedId}/study`, { method: 'POST', body: JSON.stringify({ kind, payload }) }); await loadStudy(selectedId); }
catch (e) { console.warn(e); }
}
function selText() { return (typeof window !== 'undefined' && window.getSelection ? window.getSelection().toString() : '').trim(); }
function addNote() { const t = noteDraft.trim(); if (!t) return; addStudy('note', { text: t }); noteDraft = ''; }
function addHighlight() { const s = selText(); if (!s) { studyOpen = true; alert('본문에서 형광펜 칠할 부분을 먼저 드래그하세요'); return; } addStudy('highlight', { text: s }); studyOpen = true; }
function addCard() {
const s = selText();
const code = links?.clause_code ?? selMeta?.clause_code ?? '';
addStudy('card', { cue: `${code} ${strip(clauseDoc?.title, code)}`.trim(), fact: s || (clauseDoc?.md_content ?? clauseDoc?.extracted_text ?? '').replace(/[#*>]/g, '').slice(0, 280).trim() });
studyOpen = true;
}
async function delStudy(id) {
try { await api(`/documents/${selectedId}/study/${id}`, { method: 'DELETE' }); await loadStudy(selectedId); } catch {}
}
let parts = $derived.by(() => {
const out = [], idx = {};
for (const c of clauses) {
const p = c.clause_part || '·';
if (!(p in idx)) { idx[p] = out.length; out.push({ part: p, items: [] }); }
out[idx[p]].items.push(c);
}
return out;
});
let visibleParts = $derived.by(() => {
const term = q.trim().toLowerCase();
if (!term) return parts;
return parts.map(g => ({ part: g.part, items: g.items.filter(c =>
(c.clause_code || '').toLowerCase().includes(term) || (c.title || '').toLowerCase().includes(term)) }))
.filter(g => g.items.length);
});
let selMeta = $derived(clauses.find((c) => c.id === selectedId) || null);
const strip = (t, c) => (t || '').replace(c || '', '').replace(/^[(\s)]+|[(\s)]+$/g, '').trim();
async function loadBook() {
const r = await api(`/documents/${parentId}/clauses`);
parentTitle = r?.parent_title ?? '';
clauses = r?.clauses ?? [];
const e = {};
for (const c of clauses) e[c.clause_part || '·'] = true;
expanded = e;
}
async function loadClause(id) {
if (!id) return;
loading = true;
selectedId = id;
try {
const [d, l] = await Promise.all([api(`/documents/${id}`), api(`/documents/${id}/backlinks`)]);
clauseDoc = d; links = l;
loadStudy(id);
const sel = clauses.find((c) => c.id === id);
if (sel) expanded = { ...expanded, [sel.clause_part || '·']: true };
goto(`/book/${parentId}?c=${id}`, { replaceState: true, keepFocus: true, noScroll: true });
await tick(); window.scrollTo({ top: 0 });
} finally { loading = false; }
}
onMount(async () => {
parentId = Number($page.params.id);
await loadBook();
const c = Number($page.url.searchParams.get('c'));
await loadClause(c && clauses.find((x) => x.id === c) ? c : clauses[0]?.id);
});
</script>
<div class="book">
<!-- top bar -->
<div class="bar">
<span class="brand">절-KB</span>
<span class="crumbs">{parentTitle} {#if selMeta}<b class="sep"></b> {selMeta.clause_part} <b class="sep"></b> <b>{links?.clause_code ?? selMeta.clause_code}</b>{/if}</span>
<div class="search"><input placeholder="절·조 번호 또는 키워드" bind:value={q} /></div>
<div class="tools"><span class="tool on">읽기</span><span class="tool">형광펜</span><span class="tool">노트</span><span class="tool">암기카드</span></div>
</div>
<div class="main">
<!-- left index -->
<aside class="idx">
<a class="btitle" href={`/documents/${parentId}`}>{parentTitle || '표준'}</a>
<div class="bmeta">{clauses.length} · 한 권의 책처럼 탐색</div>
{#each visibleParts as g (g.part)}
<div class="parttab" role="button" tabindex="0" onclick={() => (expanded = { ...expanded, [g.part]: !expanded[g.part] })}>
<span class="bar2"></span><span class="pname">{g.part}</span><span class="ct">{g.items.length}</span>
</div>
{#if expanded[g.part] || q.trim()}
{#each g.items as c (c.id)}
<div class="ci" class:on={c.id === selectedId} role="button" tabindex="0" onclick={() => loadClause(c.id)}>
<span class="no">{c.clause_code}</span><span class="tt">{strip(c.title, c.clause_code)}</span>
</div>
{/each}
{/if}
{/each}
</aside>
<!-- reader -->
<section class="read">
<div class="col">
{#if clauseDoc}
<div class="studybar">
<button class="sbtn" title="선택 형광펜" onclick={addHighlight}>▰</button>
<button class="sbtn" class:on={studyOpen} title="노트/공부" onclick={() => (studyOpen = !studyOpen)}>✎</button>
<button class="sbtn" title="암기카드 추가" onclick={addCard}></button>
{#if studyItems.length}<span class="scount">{studyItems.length}</span>{/if}
</div>
<div class="kicker"><span class="pth">{selMeta?.clause_part}</span></div>
<div class="h-no">{links?.clause_code ?? selMeta?.clause_code}</div>
<h1 class="h-title">{strip(clauseDoc.title, links?.clause_code ?? '')}</h1>
<div class="flow">
<button class="fl" disabled={!links?.prev} onclick={() => loadClause(links?.prev?.id)}>{links?.prev?.clause_code ?? ''}</button>
<button class="fl next" disabled={!links?.next} onclick={() => loadClause(links?.next?.id)}>{links?.next?.clause_code ?? ''}</button>
</div>
{#key clauseDoc.id}
<div class="docbody">
<MarkdownDoc documentId={clauseDoc.id} mdContent={clauseDoc.md_content ?? clauseDoc.extracted_text} mdStatus={null} class="prose prose-base max-w-none" />
</div>
{/key}
{#if links && (links.forward.length || links.back.length)}
<section class="conn">
{#if links.forward.length}
<div><h4>이 절이 참조 <span>{links.forward.length}</span></h4>
<div class="chiprow">{#each links.forward as f}
{#if f.doc_id}<button class="ref" onclick={() => loadClause(f.doc_id)}>{f.code}</button>
{:else}<span class="ref dg" title="외부/미분해">{f.code}</span>{/if}
{/each}</div></div>
{/if}
{#if links.back.length}
<div><h4>이 절을 참조 <span>{links.back.length}</span></h4>
<div class="chiprow">{#each links.back as b}<button class="ref" onclick={() => loadClause(b.doc_id)}>{b.code}</button>{/each}</div></div>
{/if}
</section>
{/if}
{#if studyOpen}
<section class="study">
<div class="slab">공부 — 노트 · 형광펜 · 암기카드{#if studyItems.length} <span>{studyItems.length}</span>{/if}</div>
<div class="noteadd">
<textarea bind:value={noteDraft} placeholder="이 절에 노트…" rows="2"></textarea>
<button class="nbtn" onclick={addNote}>노트 저장</button>
</div>
{#if studyItems.length}
<ul class="slist">
{#each studyItems as it (it.id)}
<li class="sitem">
<span class="skind k-{it.kind}">{KLABEL[it.kind] ?? it.kind}</span>
<span class="stext">{it.payload?.text ?? it.payload?.cue ?? ''}</span>
<button class="sdel" title="삭제" onclick={() => delStudy(it.id)}>×</button>
</li>
{/each}
</ul>
{:else}
<p class="shint">본문을 드래그한 뒤 형광펜(▰)/암기카드(+), 또는 위에 노트를 적으세요.</p>
{/if}
</section>
{/if}
<div class="pager">
<button class="pg" disabled={!links?.prev} onclick={() => loadClause(links?.prev?.id)}>
<div class="d">← 이전</div><div class="t"><span class="pno">{links?.prev?.clause_code ?? '—'}</span> {strip(links?.prev?.title, links?.prev?.clause_code)}</div></button>
<button class="pg next" disabled={!links?.next} onclick={() => loadClause(links?.next?.id)}>
<div class="d">다음 →</div><div class="t"><span class="pno">{links?.next?.clause_code ?? '—'}</span> {strip(links?.next?.title, links?.next?.clause_code)}</div></button>
</div>
{:else}
<p class="empty">{loading ? '불러오는 중…' : '왼쪽에서 절을 선택하세요'}</p>
{/if}
</div>
</section>
</div>
</div>
<style>
:global(body) { background: var(--bg); }
.book { --paper:#fbfcf9; --serif:"Iowan Old Style","Palatino Linotype","Noto Serif KR",Georgia,serif;
display:flex; flex-direction:column; min-height:100vh; }
.bar { display:flex; align-items:center; gap:14px; height:50px; padding:0 18px; background:var(--paper); border-bottom:1px solid var(--border); }
.brand { font-weight:700; font-size:13.5px; color:var(--text); }
.crumbs { color:var(--text-dim); font-size:12.5px; overflow:hidden; text-overflow:ellipsis; white-space:nowrap; max-width:46%; }
.crumbs b { color:var(--text); font-weight:600; } .crumbs .sep { color:#c8d6c0; margin:0 5px; }
.search { margin-left:auto; }
.search input { width:280px; background:var(--surface); border:1px solid var(--border); border-radius:9px; padding:7px 12px; font-size:13px; color:var(--text); outline:none; }
.search input:focus { border-color:var(--accent); }
.tools { display:flex; gap:2px; }
.tool { font-size:12px; color:var(--text-dim); padding:6px 10px; border-radius:8px; border:1px solid transparent; cursor:pointer; }
.tool:hover { background:var(--surface); } .tool.on { background:#ecf0e8; border-color:var(--border); color:var(--accent-hover); font-weight:600; }
.main { display:flex; align-items:flex-start; flex:1; }
.idx { width:264px; flex-shrink:0; align-self:stretch; border-right:1px solid var(--border);
background:linear-gradient(180deg,#f6f8f3,#f1f4ec); padding:16px 10px 30px 16px; position:sticky; top:0; max-height:100vh; overflow:auto; }
.btitle { display:block; font-family:var(--serif); font-size:15.5px; font-weight:600; color:var(--text); text-decoration:none; line-height:1.32; }
.btitle:hover { text-decoration:underline; }
.bmeta { font-size:11px; color:#9aa090; margin:3px 0 14px; }
.parttab { display:flex; align-items:center; gap:8px; margin:11px 0 4px; padding:3px 4px; border-radius:6px; cursor:pointer;
font-size:11px; font-weight:700; letter-spacing:.5px; color:var(--text-dim); text-transform:uppercase; }
.parttab:hover { background:#fff; } .parttab .bar2 { width:3px; height:12px; border-radius:2px; background:var(--domain-engineering); }
.parttab .pname { flex:1; overflow:hidden; text-overflow:ellipsis; white-space:nowrap; } .parttab .ct { color:#9aa090; font-weight:600; letter-spacing:0; }
.ci { display:flex; gap:9px; align-items:baseline; padding:4px 9px; border-radius:7px; cursor:pointer; line-height:1.4; }
.ci .no { font-family:ui-monospace,Menlo,monospace; font-size:11px; color:var(--accent); font-weight:600; min-width:52px; white-space:nowrap; }
.ci .tt { font-size:12.5px; color:var(--text-dim); overflow:hidden; text-overflow:ellipsis; }
.ci:hover { background:#fff; }
.ci.on { background:#fff; box-shadow:inset 3px 0 0 var(--accent), 0 1px 2px rgba(35,41,31,.05); }
.ci.on .no { color:var(--accent-hover); font-weight:700; } .ci.on .tt { color:var(--text); font-weight:600; }
.read { flex:1; min-width:0; padding:34px 40px 80px; }
.col { max-width:680px; margin:0 auto; position:relative; }
.studybar { position:absolute; right:-30px; top:4px; display:flex; flex-direction:column; gap:6px; }
.sbtn { width:34px; height:34px; border-radius:9px; border:1px solid var(--border); background:var(--paper); color:var(--text-dim); font-size:13px; cursor:pointer; }
.sbtn:hover { background:var(--surface); color:var(--accent-hover); }
.kicker { margin-bottom:5px; } .kicker .pth { font-size:11.5px; color:#9aa090; font-weight:600; letter-spacing:.3px; }
.h-no { font-family:ui-monospace,Menlo,monospace; font-size:13px; color:var(--accent); font-weight:700; letter-spacing:.5px; }
.h-title { font-family:var(--serif); font-size:26px; line-height:1.24; font-weight:600; margin:2px 0 14px; letter-spacing:-.2px; color:var(--text); }
.flow { display:flex; justify-content:space-between; gap:8px; margin-bottom:18px; }
.flow .fl { font-size:11.5px; color:var(--text-dim); background:var(--surface); border:1px solid var(--border); border-radius:8px; padding:5px 11px; cursor:pointer; }
.flow .fl:hover:not(:disabled) { background:#ecf0e8; } .flow .fl:disabled { opacity:.35; cursor:default; }
.docbody { font-size:15.5px; }
.docbody :global(.prose) { color:#2a3024; line-height:1.78; }
.docbody :global(.prose h1), .docbody :global(.prose h2), .docbody :global(.prose h3) { font-family:var(--serif); }
.docbody :global(a) { color:var(--accent-hover); }
.conn { margin-top:34px; padding-top:18px; border-top:1px solid var(--border); display:grid; grid-template-columns:1fr 1fr; gap:22px; }
.conn h4 { font-size:11px; font-weight:700; color:var(--text-dim); letter-spacing:.4px; margin:0 0 9px; } .conn h4 span { color:#9aa090; font-weight:500; }
.chiprow { display:flex; flex-wrap:wrap; gap:5px; }
.ref { font-family:ui-monospace,Menlo,monospace; font-size:11.5px; font-weight:600; color:var(--accent-hover); background:#eef4ec; border:1px solid #d9e6d8; border-radius:6px; padding:2px 8px; cursor:pointer; }
.ref:hover { background:#e2efe0; } .ref.dg { color:#9aa090; background:var(--surface); border-color:var(--border); cursor:default; }
.pager { display:flex; gap:10px; margin-top:30px; }
.pg { flex:1; text-align:left; border:1px solid var(--border); border-radius:11px; padding:11px 14px; background:var(--paper); cursor:pointer; }
.pg.next { text-align:right; } .pg:hover:not(:disabled) { border-color:#cfd7c6; background:#fff; } .pg:disabled { opacity:.4; cursor:default; }
.pg .d { font-size:10.5px; color:#9aa090; } .pg .t { font-size:12.5px; color:var(--text-dim); font-weight:600; margin-top:1px; overflow:hidden; text-overflow:ellipsis; white-space:nowrap; }
.pg .pno { font-family:ui-monospace,Menlo,monospace; color:var(--accent); }
.empty { color:#9aa090; text-align:center; padding:80px 0; }
.sbtn.on { background:#ecf0e8; color:var(--accent-hover,#3d7256); border-color:var(--border); }
.scount { font-size:9px; font-weight:700; color:#fff; background:var(--accent,#4f8a6b); border-radius:8px; padding:1px 5px; text-align:center; }
.study { margin-top:24px; padding:14px; border:1px solid var(--border); border-radius:12px; background:var(--surface); }
.slab { font-size:11px; font-weight:700; color:var(--text-dim); letter-spacing:.3px; margin-bottom:9px; }
.slab span { color:var(--accent-hover,#3d7256); }
.noteadd { display:flex; gap:8px; align-items:flex-end; margin-bottom:10px; }
.noteadd textarea { flex:1; resize:vertical; border:1px solid var(--border); border-radius:8px; padding:7px 9px; font-size:12.5px; font-family:inherit; color:var(--text); background:var(--paper,#fbfcf9); outline:none; }
.noteadd textarea:focus { border-color:var(--accent); }
.nbtn { flex-shrink:0; font-size:12px; color:#fff; background:var(--accent,#4f8a6b); border:0; border-radius:8px; padding:8px 12px; cursor:pointer; }
.nbtn:hover { background:var(--accent-hover,#3d7256); }
.slist { list-style:none; margin:0; padding:0; display:flex; flex-direction:column; gap:5px; }
.sitem { display:flex; align-items:baseline; gap:8px; padding:6px 8px; border-radius:8px; background:var(--paper,#fbfcf9); border:1px solid var(--border); }
.skind { flex-shrink:0; font-size:9.5px; font-weight:700; border-radius:4px; padding:1px 6px; }
.k-note { color:#3d7256; background:#e3efe2; border:1px solid #cfe3cd; }
.k-highlight { color:#8a6306; background:#faf3e2; border:1px solid #ecdca3; }
.k-card { color:#1d4ed8; background:#eef4fc; border:1px solid #d7e4f7; }
.stext { flex:1; font-size:12px; line-height:1.5; color:var(--text); white-space:pre-wrap; word-break:break-word; }
.sdel { flex-shrink:0; background:none; border:0; color:var(--faint,#9aa090); cursor:pointer; font-size:14px; }
.sdel:hover { color:var(--error,#c0392b); }
.shint { font-size:11.5px; color:var(--faint,#9aa090); margin:0; }
@media(max-width:820px){ .idx{display:none} .read{padding:24px 18px} .conn{grid-template-columns:1fr} .studybar{position:static;flex-direction:row} .crumbs{max-width:30%} .search input{width:150px} }
</style>
@@ -16,6 +16,7 @@
import Skeleton from '$lib/components/ui/Skeleton.svelte';
import HandwriteCanvas from '$lib/components/HandwriteCanvas.svelte';
import MarkdownDoc from '$lib/components/MarkdownDoc.svelte';
import RelatedDocs from '$lib/components/RelatedDocs.svelte';
import { renderDocMarkdown } from '$lib/utils/docMarkdown';
import MarkdownStatusBadge from '$lib/components/MarkdownStatusBadge.svelte';
import NoteEditor from '$lib/components/editors/NoteEditor.svelte';
@@ -321,6 +322,7 @@
<!-- ════ 우 슬림 레일 (시안 카드 스타일) ════ -->
{#snippet rail()}
<div style="display:flex;flex-direction:column;gap:11px;font-size:14px;">
<RelatedDocs documentId={doc.id} />
{#if doc.ai_tldr || doc.ai_summary}
<div style="background:#f4f7f1;border:1px solid #dde3d6;border-radius:14px;padding:13px;">
<div style="font-size:10.5px;font-weight:700;color:#697061;letter-spacing:.4px;margin-bottom:7px;">TL;DR</div>
+102 -4
View File
@@ -1,13 +1,52 @@
<script>
// /study — 학습 hub.
// 주제로 보기(퀴즈·복습·통계) / 자료 학습 / 필사 세션 / 암기카드 검수.
// /study — 학습 hub + 데일리 랜딩('오늘의 공부' 대시보드).
// 상단 = 이론 홈(진도·오늘의 개념·복습 due, 재노출 트리거). 하단 = 기존 모드 진입.
import { onMount } from 'svelte';
import { api } from '$lib/api';
import { BookOpen, PenLine, GraduationCap, FolderKanban, Layers, Repeat, Flag, Inbox, Activity } from 'lucide-svelte';
import { addToast } from '$lib/stores/toast';
import { BookOpen, PenLine, GraduationCap, FolderKanban, Layers, Repeat, Flag, Inbox, Activity, CalendarCheck } from 'lucide-svelte';
let cardReviewCount = $state(0);
let questionFlagCount = $state(0);
// 오늘의 공부 (이론 홈)
let curriculum = $state(null);
let todayConcepts = $state([]);
let dashLoading = $state(true);
let readPct = $derived(
curriculum && curriculum.total ? Math.round((curriculum.read / curriculum.total) * 100) : 0
);
async function loadDashboard() {
dashLoading = true;
try {
const [cur, today] = await Promise.all([
api('/study/curriculum'),
api('/study/today-concepts?limit=6'),
]);
curriculum = cur;
todayConcepts = today?.concepts ?? [];
} catch {
// 대시보드 실패해도 허브 나머지는 동작 (조용히)
} finally {
dashLoading = false;
}
}
async function markRead(doc) {
try {
await api(`/study/concepts/${doc.doc_id}/read`, { method: 'POST' });
todayConcepts = todayConcepts.filter((c) => c.doc_id !== doc.doc_id);
addToast('success', `회독: ${doc.title}`);
loadDashboard(); // 진도 갱신
} catch {
addToast('error', '회독 처리 실패');
}
}
onMount(async () => {
loadDashboard();
try {
const r = await api('/study-cards/needs-review/count');
cardReviewCount = r?.count ?? 0;
@@ -27,6 +66,64 @@
<p class="text-sm text-dim mt-1">주제별 퀴즈·복습(SRS)·통계 / 학습 자료 회독 / 손글씨 필사 세션.</p>
</header>
<!-- 오늘의 공부 (이론 홈 대시보드 = 데일리 트리거) -->
<section class="mb-5 rounded-lg border border-default bg-surface p-4 md:p-5">
<div class="flex items-center gap-2 mb-3">
<CalendarCheck size={18} class="text-accent" />
<h2 class="text-base font-semibold text-text">오늘의 공부</h2>
{#if curriculum}
<span class="ml-auto text-xs text-dim">이론 회독 <span class="text-text font-medium">{curriculum.read}</span> / {curriculum.total} ({readPct}%)</span>
{/if}
</div>
{#if dashLoading}
<p class="text-xs text-dim">불러오는 중…</p>
{:else}
{#if curriculum}
<div class="h-2 rounded-full bg-bg overflow-hidden mb-3">
<div class="h-full bg-accent" style="width: {readPct}%"></div>
</div>
<div class="flex flex-wrap gap-x-4 gap-y-1 mb-4 text-xs text-dim">
{#each curriculum.subjects as s}
<span>{s.subject} <span class="text-text">{s.read}/{s.total}</span></span>
{/each}
</div>
<div class="flex flex-wrap gap-2 mb-4">
<a
href="/study/topics/{curriculum.topic_id}/review-queue"
class="flex items-center gap-1.5 rounded border border-default px-3 py-1.5 text-xs text-dim hover:border-accent hover:text-text transition-colors"
>
<Repeat size={13} /> 문항 복습 <span class="font-semibold text-text">{curriculum.question_due}</span>
</a>
<span class="flex items-center gap-1.5 rounded border border-default px-3 py-1.5 text-xs text-dim">
<BookOpen size={13} /> 개념 재복습 <span class="font-semibold text-text">{curriculum.concept_due}</span>
</span>
</div>
{/if}
<div class="text-xs text-dim mb-2">오늘의 개념</div>
{#if todayConcepts.length === 0}
<p class="text-xs text-dim">오늘 볼 개념이 없습니다. 잘 하고 있어요.</p>
{:else}
<ul class="space-y-1.5">
{#each todayConcepts as c (c.doc_id)}
<li class="flex items-center gap-2 rounded border border-default px-3 py-2">
<span class="text-accent shrink-0 text-xs" title="빈출">{#each Array(c.freq) as _}{/each}</span>
<a href="/documents/{c.doc_id}" class="text-sm text-text hover:text-accent truncate flex-1">{c.title}</a>
<span class="shrink-0 text-[10px] rounded-full px-2 py-0.5 {c.reason === '재복습' ? 'bg-accent/15 text-accent' : 'bg-surface border border-default text-dim'}">{c.reason}</span>
<button
type="button"
onclick={() => markRead(c)}
class="shrink-0 text-xs rounded border border-default px-2 py-1 text-dim hover:border-accent hover:text-accent transition-colors"
>읽음</button>
</li>
{/each}
</ul>
{/if}
{/if}
</section>
<a
href="/study/topics"
class="block mb-3 p-5 rounded-lg border border-default bg-surface hover:border-accent hover:bg-accent/5 transition-colors"
@@ -126,7 +223,8 @@
<div class="mt-6 p-4 rounded-lg border border-dashed border-default/60 text-xs text-dim">
<div class="font-medium text-dim mb-1">예정</div>
<ul class="list-disc list-inside space-y-0.5">
<li>애플워치 빠른복습 + 공부 알람(push)</li>
<li>개념 학습 리더 (가리고 떠올리기 · 빈출★ · 관련개념 백링크)</li>
<li>이론↔문제 연결 (개념별 정답률 · 약점 개념 지도)</li>
</ul>
</div>
</div>
+23
View File
@@ -0,0 +1,23 @@
-- 377_domain_bucket.sql
-- ai_domain(반자유 AI 분류, 드리프트 존재)을 검색 스코프용 7버킷으로 결정적 롤업.
-- 축: ai_domain(routing/해석 축)의 coarsening — category(UI축) 아님 (feedback_category_vs_ai_domain_axis 준수).
-- 버킷: News / Safety / Law / Engineering / General / Philosophy / Programming.
-- STORED generated → 신규/재분류 문서도 ai_domain 붙으면 자동 버킷. ai_domain 원본 보존(하위 검색 유지).
-- 롤백: ALTER TABLE documents DROP COLUMN domain_bucket;
ALTER TABLE documents ADD COLUMN IF NOT EXISTS domain_bucket text
GENERATED ALWAYS AS (
CASE
WHEN ai_domain LIKE 'News%' THEN 'News'
WHEN ai_domain = '법령' OR ai_domain LIKE 'Industrial_Safety/Legislation%' THEN 'Law'
WHEN ai_domain = 'Safety' OR ai_domain LIKE 'Safety/%'
OR ai_domain LIKE 'Industrial_Safety%'
OR ai_domain = 'Knowledge/Industrial_Safety' THEN 'Safety'
WHEN ai_domain LIKE 'Engineering%' OR ai_domain = 'Knowledge/Engineering' THEN 'Engineering'
WHEN ai_domain LIKE 'Philosophy%' THEN 'Philosophy'
WHEN ai_domain LIKE 'Programming%' THEN 'Programming'
ELSE 'General'
END
) STORED;
CREATE INDEX IF NOT EXISTS documents_domain_bucket_idx
ON documents (domain_bucket) WHERE deleted_at IS NULL;
@@ -0,0 +1,9 @@
-- 378_publish_outbox_attempts_failed.sql
-- (번호: 멀티세션 중 prod 가 377_domain_bucket 을 선점 → 378 로 리넘버.)
-- publish_outbox poison row head-of-line block 차단. 발행 워커가 행별 savepoint 격리 후
-- 예외 시 attempts++ 하고 MAX 초과 시 failed_at 스탬프(terminal) → 그 행을 select 에서 제외해
-- 후속 발행이 막히지 않게 함. 기존 미처리 행은 attempts=0 / failed_at=NULL 로 정상 재처리.
-- (단일 ALTER = 1 statement = asyncpg prepared 호환.)
ALTER TABLE publish_outbox
ADD COLUMN IF NOT EXISTS attempts SMALLINT NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS failed_at TIMESTAMPTZ;
+37
View File
@@ -0,0 +1,37 @@
-- 379_asme_clause_kb.sql
-- ASME 절-지식베이스: 절 = 개별 documents 행(parent_id) + 절↔절 백링크 + 태깅 (additive, idempotent)
-- 검색 무접촉: 절 doc 은 embedding NULL(벡터 제외) + doc_kind='clause'(retrieval doc-leg 필터로 제외).
ALTER TABLE documents
ADD COLUMN IF NOT EXISTS parent_id bigint REFERENCES documents(id),
ADD COLUMN IF NOT EXISTS doc_kind text NOT NULL DEFAULT 'standard',
ADD COLUMN IF NOT EXISTS clause_code text,
ADD COLUMN IF NOT EXISTS clause_part text,
ADD COLUMN IF NOT EXISTS clause_order int;
CREATE INDEX IF NOT EXISTS idx_documents_parent_id ON documents(parent_id) WHERE parent_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_documents_doc_kind ON documents(doc_kind);
CREATE INDEX IF NOT EXISTS idx_documents_clause_code ON documents(clause_code) WHERE clause_code IS NOT NULL;
-- 절↔절 백링크 (dangling 허용: dst_doc_id nullable)
CREATE TABLE IF NOT EXISTS clause_links (
id bigserial PRIMARY KEY,
src_doc_id bigint NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
dst_code text NOT NULL,
dst_doc_id bigint REFERENCES documents(id) ON DELETE SET NULL,
anchor text,
ctx text,
char_off int
);
CREATE INDEX IF NOT EXISTS idx_clause_links_src ON clause_links(src_doc_id);
CREATE INDEX IF NOT EXISTS idx_clause_links_dst ON clause_links(dst_doc_id) WHERE dst_doc_id IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_clause_links_dstcode ON clause_links(dst_code);
-- 태깅 (Part 자동 + 주제)
CREATE TABLE IF NOT EXISTS document_tags (
doc_id bigint NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
tag text NOT NULL,
tag_kind text NOT NULL DEFAULT 'topic',
PRIMARY KEY (doc_id, tag)
);
CREATE INDEX IF NOT EXISTS idx_document_tags_tag ON document_tags(tag);
+9
View File
@@ -0,0 +1,9 @@
-- 380_clause_study.sql — 절-문서 공부도구(노트/형광펜/암기카드) 저장. FK 없음(documents 락 회피).
CREATE TABLE IF NOT EXISTS clause_study (
id bigserial PRIMARY KEY,
doc_id bigint NOT NULL,
kind text NOT NULL, -- 'note' | 'highlight' | 'card'
payload jsonb NOT NULL DEFAULT '{}',
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX IF NOT EXISTS idx_clause_study_doc ON clause_study(doc_id, kind);
+16
View File
@@ -0,0 +1,16 @@
-- 381_study_concept_progress.sql — 이론 개념(문서) 간격반복(SR) 진행. 이론공부 홈 트리거.
-- concept_doc_id 는 documents.id 를 가리키나 FK 미설정(hot 테이블 락 회피, clause_study 380 선례).
-- SR 산술은 study_question_progress 와 동일(sr_schedule 공용): stage 0→1→2→3(1·3·7·14일)→4 졸업.
CREATE TABLE IF NOT EXISTS study_concept_progress (
id bigserial PRIMARY KEY,
user_id bigint NOT NULL REFERENCES users(id) ON DELETE CASCADE,
study_topic_id bigint NOT NULL REFERENCES study_topics(id) ON DELETE CASCADE,
concept_doc_id bigint NOT NULL,
review_stage smallint,
due_at timestamptz,
last_read_at timestamptz,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
CONSTRAINT uq_concept_progress_user_doc UNIQUE (user_id, concept_doc_id)
);
CREATE INDEX IF NOT EXISTS idx_concept_progress_due ON study_concept_progress(user_id, due_at) WHERE due_at IS NOT NULL;
+53
View File
@@ -0,0 +1,53 @@
#!/usr/bin/env python3
"""ASME clause-KB backlinks: resolve clause-id mentions in each clause doc -> clause_links.
dst resolved to the clause doc of the same parent (top-level code); sub-code mention -> anchor;
unresolved (cross-standard / material spec not split) -> dangling (dst_doc_id NULL).
Idempotent per parent. Usage: python3 asme_backlinks_persist.py <parent_id> [--commit]
"""
import asyncio, os, re, sys
MENTION_RE = re.compile(r'(?<![A-Za-z0-9])([A-Z]{1,4}-\d+(?:\.\d+)*[A-Za-z]?)(?![A-Za-z0-9])')
def top(code): return re.match(r'^[A-Z]{1,4}-\d+', code).group(0)
async def main():
parent = int(sys.argv[1]); commit = '--commit' in sys.argv
import asyncpg
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('+asyncpg', ''))
docs = await conn.fetch("SELECT id, clause_code, md_content FROM documents "
"WHERE parent_id=$1 AND doc_kind='clause' ORDER BY clause_order", parent)
code2id = {d['clause_code']: d['id'] for d in docs}
edges = [] # (src_id, dst_code, dst_doc_id, anchor, ctx, char_off)
resolved = dangling = 0
for d in docs:
body = d['md_content']; src_top = d['clause_code']
seen = set()
for m in MENTION_RE.finditer(body):
code = m.group(1); t = top(code)
if t == src_top: continue # self-reference
if (d['id'], code) in seen: continue # dedup per (src,dst_code)
seen.add((d['id'], code))
dst_id = code2id.get(t) # resolve to same-parent clause doc
anchor = code.lower().replace('.', '-') if code != t else None
off = m.start()
ctx = re.sub(r'\s+', ' ', body[max(0, off-50):off+50]).strip()
edges.append((d['id'], code, dst_id, anchor, ctx, off))
if dst_id: resolved += 1
else: dangling += 1
print(f"parent={parent} clause_docs={len(docs)} edges={len(edges)} resolved={resolved} dangling={dangling}")
# top referenced clauses
from collections import Counter
tgt = Counter(top(e[1]) for e in edges if e[2])
print("most-referenced:", tgt.most_common(8))
if not commit:
print("DRY-RUN. pass --commit to persist."); await conn.close(); return
async with conn.transaction():
ids = [d['id'] for d in docs]
await conn.execute("DELETE FROM clause_links WHERE src_doc_id = ANY($1::bigint[])", ids)
await conn.executemany(
"INSERT INTO clause_links(src_doc_id,dst_code,dst_doc_id,anchor,ctx,char_off) "
"VALUES ($1,$2,$3,$4,$5,$6)", edges)
n = await conn.fetchval("SELECT count(*) FROM clause_links WHERE src_doc_id = ANY($1::bigint[])", ids)
print(f"COMMITTED: {n} clause_links for parent {parent}")
await conn.close()
asyncio.run(main())
+118
View File
@@ -0,0 +1,118 @@
#!/usr/bin/env python3
"""ASME clause-KB persist (v2: over-CAP pagination). Split a parent standard into per-clause
documents (A-granularity); over-CAP clause bodies are paginated into readable page-docs.
Idempotent per parent. doc_kind='clause', embedding NULL (search-excluded), parent_id=<parent>.
Usage: python3 asme_clause_persist.py <parent_id> [--commit]
"""
import asyncio, os, re, sys, hashlib, statistics
CAP = 12000; PAGE_TOK = 11000
EN, KO = 0.217, 0.529
LINE_RE = re.compile(r'^([ \t#>*]{0,8})([A-Z]{2,4}-\d+(?:\.\d+)*[A-Za-z]?)(.*)$')
MENTION_RE = re.compile(r'(?<![A-Za-z0-9])([A-Z]{1,4}-\d+(?:\.\d+)*[A-Za-z]?)(?![A-Za-z0-9])')
EXACT_TOP = re.compile(r'^[A-Z]{2,4}-\d+$')
TITLE_AFTER = re.compile(r'^[\s.]*[A-Z(]')
REF_LEAD = re.compile(r'^[\s.]*(and|or|to|of|in|on|the|as|is|are|shall|through|per|see|with|'
r'for|by|that|which|such|또는|및|등|의|은|는|에|을|를|과|와)\b', re.I)
def tok(s):
ko = sum(1 for c in s if '' <= c <= ''); return int((len(s)-ko)*EN + ko*KO)
def clean_title(rest):
t = re.sub(r'<sup>ð</sup>\s*\**\d*\**\s*<sup>Þ</sup>', '', rest)
t = re.sub(r'ð\**\d*\**Þ', '', t)
t = t.replace('**', '').replace('#', '')
return re.sub(r'\s+', ' ', t).strip(' *:—-')
def is_header(markup, rest):
if '#' in markup or '*' in markup: return True
rs = rest.strip()
if rs == '': return True
if REF_LEAD.match(rest): return False
if rs[0] in ',;.)': return False
if '' <= rs[0] <= '': return False
if rs[0].islower(): return False
return bool(TITLE_AFTER.match(rs))
def paginate(body):
"""split an over-CAP body into <=MAX_PAGES line-aligned pages of ~PAGE_TOK tokens."""
pages, cur, ct = [], [], 0
for ln in body.split('\n'):
lt = tok(ln) + 1
if ct + lt > PAGE_TOK and cur:
pages.append('\n'.join(cur)); cur, ct = [ln], lt
else:
cur.append(ln); ct += lt
if cur: pages.append('\n'.join(cur))
return pages
def build_clauses(text):
lines = text.split('\n'); off = []; a = 0
for ln in lines: off.append(a); a += len(ln) + 1
bounds = []; seen = set()
for i, ln in enumerate(lines):
m = LINE_RE.match(ln)
if not m: continue
markup, code, rest = m.group(1), m.group(2), m.group(3)
if not EXACT_TOP.match(code): continue
if not is_header(markup, rest): continue
if code in seen: continue
seen.add(code); bounds.append((off[i], code, clean_title(rest)))
raw = []
for idx, (start, code, title) in enumerate(bounds):
end = bounds[idx+1][0] if idx+1 < len(bounds) else len(text)
body = text[start:end]
part = re.match(r'^[A-Z]{2,4}', code).group(0)
links = sorted(set(re.match(r'^[A-Z]{1,4}-\d+', mm).group(0)
for mm in MENTION_RE.findall(body)) - {code})
raw.append(dict(code=code, part=part, title=(code + (' ' + title if title else '')),
body=body, tok=tok(body), links=links))
# expand over-CAP into pages; assign running clause_order
final, order = [], 0
for c in raw:
if c['tok'] <= CAP:
final.append({**c, 'order': order}); order += 1; continue
pages = paginate(c['body'])
for pi, pb in enumerate(pages):
code = c['code'] if pi == 0 else f"{c['code']}·p{pi+1}"
title = c['title'] if pi == 0 else f"{c['title']} (페이지 {pi+1}/{len(pages)})"
final.append(dict(code=code, part=c['part'], order=order, title=title,
body=pb, tok=tok(pb), links=c['links'] if pi == 0 else []))
order += 1
return final
async def main():
parent = int(sys.argv[1]); commit = '--commit' in sys.argv
import asyncpg
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('+asyncpg', ''))
row = await conn.fetchrow("SELECT md_content, ai_domain, data_origin FROM documents WHERE id=$1", parent)
if not row: print(f"parent {parent} not found"); return
clauses = build_clauses(row['md_content'])
toks = [c['tok'] for c in clauses]
over = [c for c in clauses if c['tok'] > CAP]
print(f"parent={parent} clause_docs={len(clauses)} median_tok={int(statistics.median(toks))} "
f"max_tok={max(toks)} over_cap_remaining={len(over)}")
if over: print("still over-CAP:", [f"{c['code']}:{c['tok']}t" for c in over])
if not commit:
print("DRY-RUN. pass --commit to persist."); await conn.close(); return
async with conn.transaction():
deld = await conn.execute("DELETE FROM documents WHERE parent_id=$1 AND doc_kind='clause'", parent)
print("deleted prior:", deld)
for c in clauses:
fh = hashlib.sha256(f"{parent}:{c['code']}:{c['body']}".encode()).hexdigest()
cid = await conn.fetchval("""
INSERT INTO documents
(file_format, file_hash, title, md_content, parent_id, doc_kind,
clause_code, clause_part, clause_order, ai_domain, data_origin,
md_status, review_status, conversion_status, preview_status)
VALUES ('md',$1,$2,$3,$4,'clause',$5,$6,$7,$8,$9,'success','approved','none','none')
RETURNING id
""", fh, c['title'], c['body'], parent, c['code'], c['part'], c['order'],
row['ai_domain'], row['data_origin'] or 'external')
await conn.execute("INSERT INTO document_tags(doc_id,tag,tag_kind) VALUES ($1,$2,'part') "
"ON CONFLICT DO NOTHING", cid, c['part'])
n = await conn.fetchval("SELECT count(*) FROM documents WHERE parent_id=$1 AND doc_kind='clause'", parent)
print(f"COMMITTED: {n} clause docs for parent {parent}")
await conn.close()
asyncio.run(main())
+12 -8
View File
@@ -31,8 +31,8 @@ from core.database import async_session
from models.study_memo_card_progress import StudyMemoCardProgress
from services.study.publish_enqueue import backfill_publish_card_progress
# 개인 학습툴 progress row 대비 넉넉. 도달 시 가드 경보.
PAGE = 100000
# 페이지 배치 크기 — after_id 루프로 전량 처리(bounded tx).
PAGE = 5000
async def run(dry_run: bool) -> None:
@@ -50,13 +50,17 @@ async def run(dry_run: bool) -> None:
print("[dry-run] 적재 안 함. 실제 실행은 --dry-run 제거.")
return
async with async_session() as session:
n = await backfill_publish_card_progress(session, after_id=0, limit=PAGE)
await session.commit()
total = 0
after = 0
while True:
async with async_session() as session:
n, after = await backfill_publish_card_progress(session, after_id=after, limit=PAGE)
await session.commit()
total += n
if n < PAGE:
break
print(f"\n[ok] outbox 적재 {n}건 — 발행 워커가 drain(flag on 시) 하며 rev 부여.")
if n >= PAGE:
print(f"[warn] PAGE({PAGE}) 도달 — progress 가 더 있을 수 있음. after_id 페이징 추가 필요.")
print(f"\n[ok] outbox 적재 {total}건 — 발행 워커가 drain(flag on 시) 하며 rev 부여.")
def main() -> None:
+12 -8
View File
@@ -31,8 +31,8 @@ from core.database import async_session
from models.study_memo_card import StudyMemoCard
from services.study.publish_enqueue import backfill_publish_cards
# 개인 학습툴 카드 수 대비 넉넉(단일 outbox 적재 tx, 워커는 BATCH_SIZE 로 drain). 도달 시 가드 경보.
PAGE = 100000
# 페이지 배치 크기 — after_id 루프로 전량 처리(bounded tx). 워커는 BATCH_SIZE 로 drain.
PAGE = 5000
async def run(dry_run: bool) -> None:
@@ -55,13 +55,17 @@ async def run(dry_run: bool) -> None:
print("[dry-run] 적재 안 함. 실제 실행은 --dry-run 제거.")
return
async with async_session() as session:
n = await backfill_publish_cards(session, after_id=0, limit=PAGE)
await session.commit()
total = 0
after = 0
while True:
async with async_session() as session:
n, after = await backfill_publish_cards(session, after_id=after, limit=PAGE)
await session.commit()
total += n
if n < PAGE:
break
print(f"\n[ok] outbox 적재 {n}건 — 발행 워커가 drain(flag on 시) 하며 rev 부여.")
if n >= PAGE:
print(f"[warn] PAGE({PAGE}) 도달 — 카드가 더 있을 수 있음. after_id 페이징 추가 필요.")
print(f"\n[ok] outbox 적재 {total}건 — 발행 워커가 drain(flag on 시) 하며 rev 부여.")
def main() -> None:
+11 -7
View File
@@ -37,7 +37,7 @@ from core.database import async_session
from models.study_topic import StudyTopic
from services.study.publish_enqueue import backfill_publish_topics
# 개인 학습툴 주제 수 대비 넉넉. 도달 시 overflow 가드가 경보.
# 페이지 배치 크기 — after_id 루프로 전량 처리(bounded tx).
PAGE = 5000
@@ -58,13 +58,17 @@ async def run(dry_run: bool) -> None:
print("[dry-run] 적재 안 함. 실제 실행은 --dry-run 제거.")
return
async with async_session() as session:
n = await backfill_publish_topics(session, after_id=0, limit=PAGE)
await session.commit()
total = 0
after = 0
while True:
async with async_session() as session:
n, after = await backfill_publish_topics(session, after_id=after, limit=PAGE)
await session.commit()
total += n
if n < PAGE:
break
print(f"\n[ok] outbox 적재 {n}건 — 발행 워커가 drain(flag on 시) 하며 rev 부여.")
if n >= PAGE:
print(f"[warn] PAGE({PAGE}) 도달 — 주제가 더 있을 수 있음. after_id 페이징 추가 필요.")
print(f"\n[ok] outbox 적재 {total}건 — 발행 워커가 drain(flag on 시) 하며 rev 부여.")
def main() -> None:
+1 -1
View File
@@ -106,7 +106,7 @@ async def main() -> None:
"SELECT count(*) FROM pg_indexes WHERE indexname='uq_attempt_session_question'"))).scalar()
mx = (await s.execute(text("SELECT max(version) FROM schema_migrations"))).scalar()
print(f"SCHEMA OK — max_migration={mx} documents={docs} purge_col={purge} cand_qwen={cand} attempt_uq={uq}")
assert docs and purge == 1 and cand == 0 and uq == 1 and mx == 361, "FAIL: 기대 스키마 상태 불일치"
assert docs and purge == 1 and cand == 0 and uq == 1 and mx == 378, "FAIL: 기대 스키마 상태 불일치"
# ── 5) /health 직접 호출 ──────────────────────────────────────────────
health = await main.health_check()
+100
View File
@@ -0,0 +1,100 @@
#!/usr/bin/env python3
"""기술지침(KOSHA guide) 절-KB persist: 번호섹션(# 1. 목적 / ## 4.1) 단위 분해 + 제본.
ASME/법령과 동일 clause-KB 모델(doc_kind='clause', parent_id=지침, 검색제외, /book 리더 공용).
Usage: python3 guide_clause_persist.py <id|all> [--commit]
"""
import asyncio, os, re, sys, hashlib, statistics
CAP = 12000; PAGE_TOK = 11000
EN, KO = 0.217, 0.529
# 번호섹션 헤더: '# 1. 목 적', '## 4.1 누출...' (번호 1~3자리=연도(4자리) 배제)
ART_RE = re.compile(r'^#{1,6}\s*(\d{1,3}(?:\.\d{1,3})*)\.?\s+(\S.*)$')
TOP_RE = re.compile(r'^\d{1,3}$')
# 외부 표준/법규 참조(대부분 dangling): ASME B16.5 · KS B 1501 · 규칙 제N조
EXT_RE = re.compile(r'(ASME\s+[A-Z][0-9.]+|KS\s+[A-Z]\s*[0-9]+|ISO\s+[0-9]+|제\d+조)')
def tok(s):
ko = sum(1 for c in s if '' <= c <= ''); return int((len(s)-ko)*EN + ko*KO)
def build_sections(text):
lines = text.split('\n'); off = []; a = 0
for ln in lines: off.append(a); a += len(ln) + 1
bounds = []; seen = set()
for i, ln in enumerate(lines):
m = ART_RE.match(ln)
if not m: continue
code, name = m.group(1), m.group(2).strip()
if not TOP_RE.match(code): continue # top-level 번호섹션만 경계
if code in seen: continue
if len(name) < 1: continue
seen.add(code); bounds.append((off[i], code, name))
out = []
for idx, (start, code, name) in enumerate(bounds):
end = bounds[idx+1][0] if idx+1 < len(bounds) else len(text)
body = text[start:end].strip()
ext = sorted(set(EXT_RE.findall(body)))[:8]
out.append(dict(code=code, part='본문', order=0, title=f"{code}. {name}"[:120],
body=body, tok=tok(body), links=[], ext=ext))
# over-CAP 페이지네이션 + 순번
final, order = [], 0
for c in out:
if c['tok'] <= CAP:
final.append({**c, 'order': order}); order += 1; continue
pages, cur, ct = [], [], 0
for ln in c['body'].split('\n'):
lt = tok(ln)+1
if ct+lt > PAGE_TOK and cur: pages.append('\n'.join(cur)); cur=[ln]; ct=lt
else: cur.append(ln); ct+=lt
if cur: pages.append('\n'.join(cur))
for pi, pb in enumerate(pages):
final.append(dict(code=c['code'] if pi==0 else f"{c['code']}·p{pi+1}", part='본문',
order=order, title=c['title'] if pi==0 else f"{c['title']} (p{pi+1})",
body=pb, tok=tok(pb), links=[], ext=[]))
order += 1
return final
async def process_one(conn, gid, commit, verbose=True):
row = await conn.fetchrow("SELECT title, md_content, ai_domain, data_origin FROM documents WHERE id=$1", gid)
if not row: return ('notfound', 0)
if not row['md_content']: return ('nullmd', 0)
secs = build_sections(row['md_content'])
if len(secs) < 2: return ('few', len(secs)) # 섹션 2 미만 = 번호구조 아님
toks = [c['tok'] for c in secs]
if verbose:
print(f"guide={gid} «{(row['title'] or '')[:40]}» 섹션={len(secs)} median={int(statistics.median(toks))} max={max(toks)}")
print(" 샘플:", [c['title'][:26] for c in secs[:7]])
if not commit: return ('dry', len(secs))
async with conn.transaction():
await conn.execute("DELETE FROM clause_links WHERE src_doc_id IN (SELECT id FROM documents WHERE parent_id=$1 AND doc_kind='clause')", gid)
await conn.execute("DELETE FROM documents WHERE parent_id=$1 AND doc_kind='clause'", gid)
for c in secs:
fh = hashlib.sha256(f"{gid}:{c['code']}:{c['body']}".encode()).hexdigest()
cid = await conn.fetchval("""
INSERT INTO documents (file_format,file_hash,title,md_content,parent_id,doc_kind,
clause_code,clause_part,clause_order,ai_domain,data_origin,
md_status,review_status,conversion_status,preview_status)
VALUES ('md',$1,$2,$3,$4,'clause',$5,$6,$7,$8,$9,'success','approved','none','none') RETURNING id
""", fh, c['title'], c['body'], gid, c['code'], c['part'], c['order'], row['ai_domain'], row['data_origin'] or 'external')
await conn.execute("INSERT INTO document_tags(doc_id,tag,tag_kind) VALUES ($1,'기술지침','kind') ON CONFLICT DO NOTHING", cid)
n = await conn.fetchval("SELECT count(*) FROM documents WHERE parent_id=$1 AND doc_kind='clause'", gid)
print(f" COMMITTED: {n} 섹션 for guide {gid}")
return ('committed', len(secs))
async def main():
import asyncpg
arg = sys.argv[1]; commit = '--commit' in sys.argv
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('+asyncpg', ''))
if arg == 'all':
gs = await conn.fetch("SELECT id FROM documents WHERE material_type='guide' AND doc_kind='standard' "
"AND deleted_at IS NULL AND md_content IS NOT NULL ORDER BY id")
agg = {}; tot = 0
for i, r in enumerate(gs):
st, n = await process_one(conn, r['id'], commit, verbose=False)
agg[st] = agg.get(st, 0)+1; tot += n if st in ('dry','committed') else 0
if commit and (i+1) % 40 == 0: print(f"{i+1}/{len(gs)} (누적섹션 {tot})")
print(f"BATCH {'COMMIT' if commit else 'DRY'} guides={len(gs)} status={agg} 총섹션={tot}")
else:
await process_one(conn, int(arg), commit, verbose=True)
await conn.close()
asyncio.run(main())
+146
View File
@@ -0,0 +1,146 @@
#!/usr/bin/env python3
"""법령 조-KB persist: 법령을 조(條) 단위 개별 문서로 분해 + 조↔조 백링크 + 장(章) 태그.
ASME clause-KB와 동일 모델(doc_kind='clause', parent_id=법령, embedding NULL, 검색제외).
법령 추출 노이즈( ### 메타 반복) 트림. Usage: python3 law_clause_persist.py <law_id> [--commit]
"""
import asyncio, os, re, sys, hashlib, statistics
CAP = 12000; PAGE_TOK = 11000
EN, KO = 0.217, 0.529
# 조 헤더: '### 제3조의2(가스안전관리...) 본문'
ART_RE = re.compile(r'^#{0,6}\s*(제\d+조(?:의\d+)?)\s*\(([^)]*)\)\s*(.*)$')
CHAP_RE = re.compile(r'^#{1,6}\s*(제\d+장(?:의\d+)?)\s*(.*)$') # 장 = part
# 같은-법 조 멘션(백링크)
MENTION_RE = re.compile(r'\d+조(?:의\d+)?')
# 타법 참조: 「법명」 ... 제N조
EXTLAW_RE = re.compile(r'「([^」]+)」')
def tok(s):
ko = sum(1 for c in s if '' <= c <= ''); return int((len(s)-ko)*EN + ko*KO)
def art_code(c): return c # '제3조의2'
def build_articles(text):
lines = text.split('\n'); off = []; a = 0
for ln in lines: off.append(a); a += len(ln) + 1
arts = [] # (line_idx, code, name, part)
cur_part = None
for i, ln in enumerate(lines):
ch = CHAP_RE.match(ln)
if ch and not ART_RE.match(ln):
cur_part = (ch.group(1) + (' ' + ch.group(2).strip() if ch.group(2).strip() else '')).strip()
continue
m = ART_RE.match(ln)
if m:
arts.append((i, m.group(1), m.group(2).strip(), cur_part))
# 본문 슬라이스 + 다음 조 앞 메타 노이즈 트림
out = []
for idx, (li, code, name, part) in enumerate(arts):
end_li = arts[idx+1][0] if idx+1 < len(arts) else len(lines)
body_lines = lines[li:end_li]
# 트림: 끝에서부터 '### {짧은 메타}' (조번호/조문/날짜/제목, [개정] 제N조 아님) 제거
while len(body_lines) > 1:
last = body_lines[-1].strip()
if last == '':
body_lines.pop(); continue
mh = re.match(r'^#{1,6}\s+(.*)$', last)
if mh:
c = mh.group(1).strip()
if not c.startswith('[') and not c.startswith('') and (
c in ('조문', 'N') or re.fullmatch(r'\d+', c) or re.fullmatch(r'\d{8}', c) or len(c) <= 30):
body_lines.pop(); continue
break
body = '\n'.join(body_lines).strip()
links = sorted(set(MENTION_RE.findall(body)) - {code})
ext = sorted(set(EXTLAW_RE.findall(body)))[:6]
out.append(dict(code=code, part=part or '본칙', order=0,
title=f"{code}({name})" if name else code,
body=body, tok=tok(body), links=links, ext=ext))
# 페이지네이션(over-CAP) + 순번
final, order = [], 0
for c in out:
if c['tok'] <= CAP:
final.append({**c, 'order': order}); order += 1; continue
# 11K 토큰 라인 단위 분할
pages, cur, ct = [], [], 0
for ln in c['body'].split('\n'):
lt = tok(ln)+1
if ct+lt > PAGE_TOK and cur: pages.append('\n'.join(cur)); cur=[ln]; ct=lt
else: cur.append(ln); ct+=lt
if cur: pages.append('\n'.join(cur))
for pi, pb in enumerate(pages):
final.append(dict(code=c['code'] if pi==0 else f"{c['code']}·p{pi+1}", part=c['part'],
order=order, title=c['title'] if pi==0 else f"{c['title']} (p{pi+1}/{len(pages)})",
body=pb, tok=tok(pb), links=c['links'] if pi==0 else [], ext=[]))
order += 1
return final
async def process_one(conn, law, commit, verbose=True):
row = await conn.fetchrow("SELECT title, coalesce(md_content, extracted_text) AS md_content, ai_domain, data_origin FROM documents WHERE id=$1", law)
if not row: return ('notfound', 0, 0)
if not row['md_content']: return ('nullmd', 0, 0)
arts = build_articles(row['md_content'])
if not arts: return ('noart', 0, 0)
toks = [c['tok'] for c in arts]
nlink = sum(len(c['links']) for c in arts)
if verbose:
parts = {}
for c in arts: parts[c['part']] = parts.get(c['part'], 0)+1
print(f"law={law} «{(row['title'] or '')[:34]}» 조문={len(arts)} median={int(statistics.median(toks))} "
f"max={max(toks)} 장={len(parts)} 백링크={nlink}")
print(" 샘플:", [c['title'][:22] for c in arts[:6]])
if not commit:
return ('dry', len(arts), nlink)
async with conn.transaction():
await conn.execute(
"DELETE FROM clause_links WHERE src_doc_id IN (SELECT id FROM documents WHERE parent_id=$1 AND doc_kind='clause')", law)
await conn.execute("DELETE FROM documents WHERE parent_id=$1 AND doc_kind='clause'", law)
code2id = {}
for c in arts:
fh = hashlib.sha256(f"{law}:{c['code']}:{c['body']}".encode()).hexdigest()
cid = await conn.fetchval("""
INSERT INTO documents (file_format,file_hash,title,md_content,parent_id,doc_kind,
clause_code,clause_part,clause_order,ai_domain,data_origin,
md_status,review_status,conversion_status,preview_status)
VALUES ('md',$1,$2,$3,$4,'clause',$5,$6,$7,$8,$9,'success','approved','none','none') RETURNING id
""", fh, c['title'], c['body'], law, c['code'], c['part'], c['order'],
row['ai_domain'], row['data_origin'] or 'external')
code2id[c['code']] = cid
await conn.execute("INSERT INTO document_tags(doc_id,tag,tag_kind) VALUES ($1,$2,'chapter') ON CONFLICT DO NOTHING", cid, c['part'])
# 조↔조 백링크 (같은 법 내부; 타법 참조는 dangling)
edges = []
for c in arts:
src = code2id[c['code']]
for dst in c['links']:
edges.append((src, dst, code2id.get(dst), None, None, None))
if edges:
await conn.executemany(
"INSERT INTO clause_links(src_doc_id,dst_code,dst_doc_id,anchor,ctx,char_off) VALUES ($1,$2,$3,$4,$5,$6)", edges)
n = await conn.fetchval("SELECT count(*) FROM documents WHERE parent_id=$1 AND doc_kind='clause'", law)
print(f" COMMITTED: {n} 조문 + {len(edges)} 백링크 for law {law}")
return ('committed', n, len(edges))
async def main():
import asyncpg
arg = sys.argv[1]; commit = '--commit' in sys.argv
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('+asyncpg', ''))
if arg == 'all':
laws = await conn.fetch("SELECT lm.document_id AS id FROM legal_meta lm "
"JOIN documents d ON d.id=lm.document_id "
"WHERE lm.law_doc_kind='primary' AND lm.version_status='current' "
"AND coalesce(d.md_content, d.extracted_text) IS NOT NULL "
"ORDER BY lm.document_id")
agg = {}; tot_art = tot_link = 0; zero = []
for i, r in enumerate(laws):
st, na, nl = await process_one(conn, r['id'], commit, verbose=False)
agg[st] = agg.get(st, 0) + 1
tot_art += na; tot_link += nl
if st == 'noart': zero.append(r['id'])
if commit and (i + 1) % 30 == 0: print(f"{i+1}/{len(laws)} (누적 조 {tot_art})")
print(f"BATCH {'COMMIT' if commit else 'DRY'} laws={len(laws)} status={agg} 총조문={tot_art} 총백링크={tot_link}")
if zero: print(f" 0-조(추출구조 이질) {len(zero)}건: {zero[:20]}")
else:
await process_one(conn, int(arg), commit, verbose=True)
await conn.close()
asyncio.run(main())
+51
View File
@@ -0,0 +1,51 @@
#!/usr/bin/env python3
"""논문 인용그래프 가능성 측정(read-only) — 본문 DOI로 코퍼스내 인용 엣지 추정.
own_doi = 헤더( 2500) DOI / cited = References 이후(또는 전체) DOI. owner 엣지.
"""
import asyncio, os, re, sys
DOI_RE = re.compile(r'10\.\d{4,9}/[^\s"<>)\]\},;]+')
REF_RE = re.compile(r'(references|참고문헌|bibliography|reference\s*list)', re.I)
def norm(d): return d.rstrip('.').lower()
async def main():
import asyncpg
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('+asyncpg', ''))
rows = await conn.fetch("SELECT id, title, coalesce(md_content, extracted_text) AS txt FROM documents "
"WHERE material_type='paper' AND doc_kind='standard' AND deleted_at IS NULL "
"AND coalesce(md_content, extracted_text) IS NOT NULL")
owner = {} # doi -> paper id (헤더 DOI = 그 논문 소유)
cited = {} # paper id -> set(cited doi)
n_own = n_refsec = 0
for r in rows:
txt = r['txt']
head = txt[:2500]
hdois = [norm(d) for d in DOI_RE.findall(head)]
if hdois:
owner.setdefault(hdois[0], r['id']); n_own += 1
m = REF_RE.search(txt)
body = txt[m.start():] if m else ''
if m: n_refsec += 1
cds = set(norm(d) for d in DOI_RE.findall(body))
if cds: cited[r['id']] = cds
# 엣지: paper -> owner(cited doi)
edges = []
for pid, cds in cited.items():
for d in cds:
o = owner.get(d)
if o and o != pid: edges.append((pid, o, d))
cited_papers = set(e[0] for e in edges)
target_papers = set(e[1] for e in edges)
print(f"papers={len(rows)} 헤더DOI보유={n_own} References보유={n_refsec} owner_map={len(owner)}")
print(f"인용엣지(코퍼스내)={len(edges)} 인용하는논문={len(cited_papers)} 피인용논문={len(target_papers)}")
# 피인용 top
from collections import Counter
top = Counter(e[1] for e in edges).most_common(6)
if top:
idmap = {r['id']: r['title'] for r in rows}
print("피인용 top:")
for pid, c in top: print(f" {c}회 ← {(idmap.get(pid) or '')[:48]}")
await conn.close()
asyncio.run(main())
+39
View File
@@ -0,0 +1,39 @@
#!/usr/bin/env python3
"""OpenAlex 고신뢰 매치율 측정 — References 보유 논문(학술 추정) 표본."""
import asyncio, os, re
def toks(s):
return set(re.findall(r'[a-z0-9]+', (s or '').lower()))
def sim(a, b):
ta, tb = toks(a), toks(b)
if not ta or not tb: return 0.0
return len(ta & tb) / len(ta | tb)
async def main():
import asyncpg, httpx
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('+asyncpg', ''))
rows = await conn.fetch("SELECT id, title FROM documents WHERE material_type='paper' "
"AND doc_kind='standard' AND deleted_at IS NULL AND title IS NOT NULL "
"AND coalesce(md_content,extracted_text) ~* 'references|참고문헌' "
"ORDER BY id LIMIT 40")
hi = mid = lo = 0; hits = []
async with httpx.AsyncClient(timeout=20) as client:
for r in rows:
title = re.sub(r'\s+', ' ', r['title']).strip()
try:
resp = await client.get("https://api.openalex.org/works",
params={"search": title[:200], "per_page": 1, "mailto": "hyun49196@gmail.com"})
res = (resp.json().get("results") or [])
if not res: lo += 1; continue
s = sim(title, res[0].get("title"))
if s >= 0.6: hi += 1; hits.append((s, title[:40], (res[0].get('title') or '')[:40], res[0].get('cited_by_count'), len(res[0].get('referenced_works') or [])))
elif s >= 0.4: mid += 1
else: lo += 1
except Exception: lo += 1
print(f"표본={len(rows)} 고신뢰(≥0.6)={hi} 중간(0.4~0.6)={mid} 저신뢰/무매치={lo}")
print("고신뢰 매치 샘플:")
for s, a, b, cb, rf in hits[:8]:
print(f" sim={s:.2f} cited={cb} refs={rf} | {a}{b}")
await conn.close()
asyncio.run(main())
+30
View File
@@ -0,0 +1,30 @@
#!/usr/bin/env python3
"""OpenAlex 보강 타당성 테스트 — 소수 논문 제목으로 매칭/메타 확인 (외부 API)."""
import asyncio, os, re
async def main():
import asyncpg, httpx
conn = await asyncpg.connect(os.environ['DATABASE_URL'].replace('+asyncpg', ''))
rows = await conn.fetch("SELECT id, title FROM documents WHERE material_type='paper' "
"AND doc_kind='standard' AND deleted_at IS NULL AND title IS NOT NULL "
"AND length(title) > 15 ORDER BY id LIMIT 6")
async with httpx.AsyncClient(timeout=20) as client:
for r in rows:
title = re.sub(r'\s+', ' ', r['title']).strip()
try:
resp = await client.get("https://api.openalex.org/works",
params={"search": title[:200], "per_page": 1, "mailto": "hyun49196@gmail.com"})
js = resp.json()
res = (js.get("results") or [])
if not res:
print(f"[{r['id']}] NO MATCH | {title[:50]}"); continue
w = res[0]
oid = (w.get("id") or "").split("/")[-1]
print(f"[{r['id']}] {title[:46]}")
print(f" → OA {oid} | {(w.get('title') or '')[:46]} | {w.get('publication_year')} | "
f"cited_by={w.get('cited_by_count')} | refs={len(w.get('referenced_works') or [])} | doi={w.get('doi')}")
except Exception as e:
print(f"[{r['id']}] ERROR {type(e).__name__}: {e}")
await conn.close()
asyncio.run(main())
+25 -2
View File
@@ -59,6 +59,11 @@ MAX_IMAGES_PER_DOC = int(os.getenv("MINERU_MAX_IMAGES_PER_DOC", "200"))
MAX_BYTES_PER_IMAGE = int(os.getenv("MINERU_MAX_BYTES_PER_IMAGE", str(10 * 1024 * 1024)))
MAX_PAGES_HARD = int(os.getenv("MINERU_MAX_PAGES_HARD", "200")) # 1-shot max_pages 안전장치
# self-timeout — 변환/워밍이 vLLM 행으로 _engine_lock 을 영구 점유해 서비스가 wedge 되는 것을 차단.
# (클라이언트 marker_worker 는 300s 로 포기하나 서버측 inflight 는 자동 취소 안 됨 → 서버 자체 상한 필요.)
PARSE_TIMEOUT_S = float(os.getenv("MINERU_PARSE_TIMEOUT_S", "600"))
WARMUP_TIMEOUT_S = float(os.getenv("MINERU_WARMUP_TIMEOUT_S", "1200")) # 최초 모델 다운로드(~2.4GB) 여유
_PRELOAD = os.getenv("MINERU_PRELOAD", "1") != "0"
# ---- 엔진 상태 ---------------------------------------------------------------
@@ -68,6 +73,15 @@ _warmup_error: str | None = None
_engine_lock = asyncio.Lock()
def _is_engine_fatal(exc: BaseException) -> bool:
"""OOM/CUDA 류 = 엔진 상태 오염 가능 → 재워밍 강제 대상(타임아웃은 호출측에서 별도 판정)."""
s = f"{type(exc).__name__} {exc}".lower()
return any(
k in s
for k in ("out of memory", "oom", "cuda", "cublas", "device-side", "illegal memory")
)
async def _run_mineru(pdf_bytes: bytes, lang: str) -> tuple[str, list[dict]]:
"""슬라이스된 PDF bytes → (markdown, 이미지 dict 리스트). **async 엔진 경로.**
@@ -148,7 +162,7 @@ async def _ensure_warmup() -> None:
page.insert_text((72, 72), "MinerU warmup.")
warmup_bytes = doc.tobytes()
doc.close()
await _run_mineru(warmup_bytes, MINERU_LANG)
await asyncio.wait_for(_run_mineru(warmup_bytes, MINERU_LANG), timeout=WARMUP_TIMEOUT_S)
_warmup_done = True
_warmup_error = None
logger.info(f"[mineru-service] warmup done engine_version={_engine_version}")
@@ -274,6 +288,7 @@ def _serialize_images(images: list[dict], src_path: str) -> tuple[list[ConvertIm
@app.post("/convert", response_model=ConvertResponse)
async def convert(req: ConvertRequest):
global _warmup_done
p = _resolve_path(req.file_path)
if p is None or not p.is_file():
raise HTTPException(404, detail={"code": "file_not_found", "message": req.file_path})
@@ -288,10 +303,18 @@ async def convert(req: ConvertRequest):
async with _engine_lock: # 실제 변환 직렬화(단일 GPU)
start = time.monotonic()
try:
md_text, raw_images = await _run_mineru(pdf_bytes, MINERU_LANG)
md_text, raw_images = await asyncio.wait_for(
_run_mineru(pdf_bytes, MINERU_LANG), timeout=PARSE_TIMEOUT_S
)
except HTTPException:
raise
except Exception as exc:
# 타임아웃(엔진 행) 또는 OOM/CUDA 류면 엔진 오염 가능 → 다음 요청이 재워밍하도록 리셋.
# 재워밍까지 실패하면 _ensure_warmup 이 _warmup_error 설정 → /ready 503 → healthcheck
# 재시작으로 escalate(영구 degradation 차단). 일시 OOM 이면 재워밍 성공 후 정상화.
if isinstance(exc, (asyncio.TimeoutError, TimeoutError)) or _is_engine_fatal(exc):
_warmup_done = False
logger.error("[mineru-service] engine reset (timeout/fatal) path=%s: %s", p, exc)
logger.exception(f"[mineru-service] conversion failed path={p}: {exc}")
raise HTTPException(422, detail={"code": "conversion_failed",
"message": f"{type(exc).__name__}: {exc}"}) from exc