Files
hyungi_document_server/tests/test_arxiv_collector_units.py
T
Claude Code ba943d703a feat(papers): B-3 PR2 — arXiv 키워드 필터 수집기 (signal-only, per-run cap)
plan safety-library-b3-1 PR2 (keyless). DOI 코어(PR1) 위 첫 실수집기.
- bespoke arXiv API(Atom) 수집기: cat:{category} AND (abs:키워드) — RSS 통째(firehose) 아님.
  신규 7 카테고리(eess.SY·physics.flu-dyn/comp-ph·math.OC/NA·stat.AP·cs.CE) x 압력용기/공정안전 키워드.
- signal-only: 초록만 색인(embed+chunk), summarize 절대 미enqueue(맥미니 큐 무접촉).
- DOI 보유 -> extract_meta.paper.doi(holder, partial-unique 인덱스). 없으면 arXiv id dedup.
  교차소스 dedup = find_paper_holder(PR1) + arxiv id file_hash. paper.source_region=INT(jurisdiction NULL 유지).
- per-run insert cap(_RUN_CAP=80) — 광역 수집이 GPU embed 큐 범람 방지(적대리뷰 A major), 잔여 로깅.
- etiquette: >=3s + 429 백오프 + 카테고리별 submittedDate 워터마크 증분. https 필수(http=301).
- enabled=False news_sources 행 + main.py CronTrigger(daily 07:30 KST). __main__ CLI(--bulk/--limit).

순수 파서·쿼리빌더 fixture 단위 18 passed(arxiv 실응답 박제: DOI/journal_ref/둘다없음 3경로).
적재(run/_ingest_entry)는 news_collector signal-only 패턴 미러 — 배포 후 라이브 검증.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 22:10:25 +00:00

76 lines
2.4 KiB
Python

"""B-3 PR2 — arXiv 파서·쿼리빌더 순수 단위 테스트 (plan safety-library-b3-1).
fixture = arXiv API 실응답 박제(abs:"pressure vessel" relevance 10건 —
DOI 보유 / journal_ref 만 보유 / 둘 다 없음 3경로 포함). run()/적재(DB)는 PR2 라이브 검증.
"""
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "app"))
from workers.arxiv_collector import ( # noqa: E402
build_search_query,
parse_arxiv_feed,
)
FIX = Path(__file__).parent / "fixtures" / "arxiv_search_pressure_vessel.xml"
def _entries():
total, entries = parse_arxiv_feed(FIX.read_text(encoding="utf-8"))
return total, {e.arxiv_id: e for e in entries}, entries
# ─── 피드 레벨 ───
def test_feed_total_and_count():
total, by_id, entries = _entries()
assert total == 89 # fixture totalResults (페이징 재료)
assert len(entries) == 10
def test_versionless_ids():
_, by_id, entries = _entries()
# arxiv_id 는 versionless (버전 접미는 .version 으로 분리)
assert all("/" not in e.arxiv_id for e in entries)
assert "1209.2405" in by_id and by_id["1209.2405"].version == "v1"
# ─── DOI 보유 entry ───
def test_entry_with_doi():
_, by_id, _ = _entries()
e = by_id["1209.2405"]
assert e.doi == "10.1063/1.4707088" # normalize_doi 적용(소문자·정규화)
assert e.journal_ref is None
assert e.primary_category == "physics.acc-ph"
assert e.title.startswith("A Survey of Pressure Vessel")
assert len(e.summary) > 200 # 초록 본문
assert e.published is not None
assert e.abs_url and "/abs/" in e.abs_url
assert e.pdf_url and "pdf" in e.pdf_url
# ─── journal_ref 만 (DOI 없음) — 압력용기 저널 출판분 ───
def test_entry_journal_ref_without_doi():
_, by_id, _ = _entries()
e = by_id["0804.0261"]
assert e.doi is None
assert e.journal_ref and "Pressure Vessel" in e.journal_ref
# ─── 둘 다 없음(최근 preprint) 경로도 존재 ───
def test_entry_neither_doi_nor_journal_ref_exists():
_, _, entries = _entries()
assert any(e.doi is None and e.journal_ref is None for e in entries)
# ─── 쿼리 빌더 ───
def test_build_search_query():
q = build_search_query("eess.SY", ["pressure vessel", "safety"])
assert q == 'cat:eess.SY AND (abs:"pressure vessel" OR abs:safety)'