23bb5ac9c9
ToC 없는/게이트 미달 대형 PDF(>=60p)에 한해 off-card Qwen(맥북, call_deep_or_defer, StageDeferred-safe) 경계 제안 → 동일 검증게이트(_is_clear_bundle) 통과 시에만 deterministic 과 공유하는 _create_children 로 분할. is_bundle=false/파싱·검증 실패=단일문서(오늘과 동일)+로깅. - env PRESEGMENT_LLM_FALLBACK 기본 false → 배포 동작 무변(LLM 미호출, 검증=unit test) - 자식생성 _create_children 공유 헬퍼로 리팩터(deterministic+LLM 단일 경로, 동작 동일) - SegmentationOutput Pydantic + parse_json_response(house 패턴) + per-page heading 샘플(본문 미전송) - prompt app/prompts/presegment_boundaries.txt + tests/test_presegment_llm.py(14, fitz/DB/LLM mock) no direct HTTP·no silent fallback. 활성=flag ON + 실 router fixture 검증 후. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
42 lines
1.8 KiB
Plaintext
42 lines
1.8 KiB
Plaintext
You are a document-boundary detector. Output ONLY JSON {is_bundle, segments:[{start_page,end_page,title}]}.
|
|
|
|
You are given a single PDF that may be a "bundle" — several independent logical documents
|
|
concatenated into one file (for example: multiple laws, multiple reports, or multiple papers
|
|
scanned together). Your job is to decide whether it is a bundle and, if so, where each logical
|
|
document starts and ends.
|
|
|
|
You receive only a compact sample per page: the page number and the first line / heading of that
|
|
page (text may be truncated). Use these heading/first-line signals to detect where a new logical
|
|
document begins (a new title page, a new cover, a clearly new document title, a restart of
|
|
numbering, etc.). You do NOT receive the full text.
|
|
|
|
Output rules:
|
|
- Respond with STRICT JSON only. No prose, no markdown, no code fence.
|
|
- Schema:
|
|
{
|
|
"is_bundle": true | false,
|
|
"segments": [
|
|
{"start_page": <int>, "end_page": <int>, "title": "<string or null>"}
|
|
]
|
|
}
|
|
- Page numbers are 1-based and INCLUSIVE. start_page=1 is the first page; end_page equals the last
|
|
page of that segment.
|
|
- Segments MUST fully cover every page with NO gaps and NO overlaps:
|
|
- the first segment MUST start at page 1,
|
|
- each next segment MUST start exactly one page after the previous segment's end_page,
|
|
- the last segment MUST end at the final page (page_count).
|
|
- Order segments by start_page ascending.
|
|
- title = a short title for that logical document if you can infer one from its first page,
|
|
otherwise null.
|
|
|
|
If the file is NOT a bundle (it is a single logical document), respond:
|
|
{"is_bundle": false, "segments": []}
|
|
|
|
Be conservative: only report is_bundle=true when the heading signals clearly indicate separate
|
|
logical documents. When unsure, return is_bundle=false.
|
|
|
|
page_count: {page_count}
|
|
|
|
Per-page samples (one per line, "p{n}: {first line}"):
|
|
{page_samples}
|