feat(papers): B-3 PR4 — 레거시 arXiv DOI reconcile + arXiv DataCite DOI 통일 (keyless)

plan safety-library-b3-1 PR4. paper.doi 없는 paper 행을 arXiv DataCite DOI 로 스탬프해
partial-unique 인덱스 편입 → 재유입 차단('동일-DOI 재유입 차단만').
- doi.py: parse_arxiv_id(본문→arXiv id) + arxiv_doi(10.48550/arxiv.{id}, OpenAlex canonical 실측 일치).
- ★arXiv DOI 통일: arxiv_collector 도 프리프린트(저널 DOI 없음)에 arxiv_doi 부여 → PR2/PR3/PR4 가 같은
  함수로 같은 paper.doi → 교차소스 dedup 성립(이전엔 프리프린트 paper.doi 부재로 PR2↔PR3 dup 갭).
- paper_doi_reconcile.py: 전용 worker(dedup_reconcile=file_hash 캐시와 별개 — 적대리뷰 B·C major).
  keyless·결정적(OpenAlex 호출 0)·in-DB·enqueue 0(콘텐츠 무변경). 선재 DOI holder 시 parent_doi
  마킹(unique 위반 회피). add_job daily 03:50 KST. __main__ CLI.

단위 28 passed(+parse_arxiv_id·arxiv_doi). 라이브 PASS (prod, running fastapi 무접촉):
레거시 197행 arXiv DataCite 스탬프·ASME 2행 skip·선재중복 0 / dedup 불변식 206 distinct 206(인덱스 무위반) /
paper summarize active 0(signal-only). 멱등.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude Code
2026-06-13 22:54:24 +00:00
parent fdabca2a2f
commit 244d526ae2
5 changed files with 132 additions and 6 deletions
+20
View File
@@ -9,8 +9,10 @@ from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "app"))
from services.papers.doi import ( # noqa: E402
arxiv_doi,
normalize_doi,
paper_doi_hash,
parse_arxiv_id,
read_paper_doi,
with_paper_doi,
with_parent_doi,
@@ -109,3 +111,21 @@ def test_read_paper_doi():
assert read_paper_doi(None) is None
assert read_paper_doi({"paper": {"parent_doi": "10.1/p"}}) is None # child 는 doi 없음
assert read_paper_doi({"paper": {}}) is None
# ─── PR4: arXiv id 파싱 + arXiv DataCite DOI (교차소스 dedup 통일 키) ───
def test_parse_arxiv_id():
assert parse_arxiv_id("Title arXiv:2606.10236v1 Announce Type: new Abstract") == "2606.10236"
assert parse_arxiv_id("see arXiv:2601.02852 for details") == "2601.02852"
assert parse_arxiv_id("arXiv:cond-mat/0703470v2") == "cond-mat/0703470"
assert parse_arxiv_id("no arxiv here") is None
assert parse_arxiv_id(None) is None
def test_arxiv_doi_canonical():
# OpenAlex canonical 실측 일치: 10.48550/arxiv.{id} (소문자)
assert arxiv_doi("2606.10236") == "10.48550/arxiv.2606.10236"
assert arxiv_doi(None) is None
# 수집기·reconcile 가 같은 함수 → 같은 paper.doi (교차소스 dedup 성립)
assert arxiv_doi(parse_arxiv_id("x arXiv:2606.10236v1 y")) == "10.48550/arxiv.2606.10236"