feat(documents): hier 절 char_start offset (Path B) — md_content 점프 builder offset
플랜 ds-outline-anchor-b5 (g1~g6 코드). 핵심 ASME/법령 windowed 절의 0% 점프를
서버계산 char_start(builder offset)로 100% deterministic 점프로 전환.
- g1 migration 318: document_chunks.char_start INTEGER NULL (단일 statement, 멱등)
- g2 builder: char_start emit = FE 라인/offset 모델 미러(split('\n')+UTF-16 code unit+코드펜스 skip).
window-child=NULL, split-parent=heading offset, preamble=NULL, CR 미strip, NFC=telemetry.
node.text 보존(라인모델 hash-neutral) → hash_stable doc 보존. 단위테스트 7건.
- g3 persist+backfill 하이브리드:
* persist INSERT char_start
* update-char-start (g3-tU): hash_stable doc 비파괴 — 100% jump-target VERIFY(NEW-1) +
position-aligned PK UPDATE(NEW-2), 미달 doc DEMOTE → re-decompose 합류(NEW-4)
* --reprocess (g3-t2): md_content 출처(g0-t1) + jump-target-set 완료마커(B1) + B_jumptarget>=1(B3),
--doc 필수 else REFUSE. self-heal sweep(g3-t3).
- g4 /sections: char_start inner+outer SELECT + split-parent 노출(is_leaf OR %_split)
- g5 FE: resolveAnchorMap(BE-first, NEW-5 jump-target-candidate-scoped 폴백, C1 OR-exclude),
per-render-site basis guard(C3), endsWith('_split') 정정 + collapseWindows split-parent 흡수(C2).
단위테스트 25건(NEW-5/B4/C1/C2 포함).
- g6 hier_outline_quality_gate.py: read-only g-measure(verdict/B_jumptarget/hash_stable/dup/fence)
배포(g7: --no-deps, 스냅샷, UPDATE-only 32 + re-decompose 230∪demote, 정확도 게이트)는 별 ops 단계.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,196 @@
|
||||
"""hier 개요 keep-better 게이트 + g-measure 엔진 (플랜 ds-outline-anchor-b5 g6-t1 / gm-t1).
|
||||
|
||||
READ-ONLY dry-run. doc 별로:
|
||||
(A) 현 저장 hier 절제목 (source_type='hier_section', char_start IS NULL = extracted_text 산)
|
||||
(B) build_hier_tree(md_content) 절제목 (= 새 g2 builder: split('\n')+UTF-16+fence skip)
|
||||
를 비교해 산출:
|
||||
- verdict {B_better, A_better, equivalent} (+ junk-heading 검출 → A_better 보호)
|
||||
- B_jumptarget_count (build 후 jump-target node 수) — B3 게이트 입력
|
||||
- hash_stable 판정 — UPDATE-only(g3-tU) vs re-decompose(g3-t2) 라우팅:
|
||||
* hash_stable_strict = build(md) 가 저장 hier hash 를 position-by-position 100% 재현
|
||||
(= 런타임 g3-tU 가 UPDATE-only 로 처리할 정확한 집합; demote 안 함)
|
||||
* hash_stable_99 = >=99% 재현 (원 MEASURE2 분류 기준 — 비교용)
|
||||
- dup_title_count / has_fence (measure3 budget note: fence 보유 doc 은 새 builder 에서 hash_changed flip 가능)
|
||||
- REFINED PASS = (verdict B>=A) AND (B_jumptarget>=1)
|
||||
|
||||
★ gm-t1 재확인(이 빌드의 유일 잔여 측정): g2 builder 코딩 후 1회 실행 → REFINED PASS 중
|
||||
hash_changed(=re-decompose) count 가 ~230 인지 확인(코드펜스-skip 으로 32 중 ≤2 flip → 최대 ~232 수용).
|
||||
|
||||
실행 (GPU 서버, 컨테이너):
|
||||
docker compose exec -T fastapi python /app/scripts/hier_outline_quality_gate.py run
|
||||
docker compose exec -T fastapi python /app/scripts/hier_outline_quality_gate.py run --json /tmp/measure.json
|
||||
docker compose exec -T fastapi python /app/scripts/hier_outline_quality_gate.py run --doc 5140,5209,5165 # 코어 spot-check
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import asyncio
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from sqlalchemy import text
|
||||
from sqlalchemy.ext.asyncio import async_sessionmaker, create_async_engine
|
||||
|
||||
from services.hier_decomp.builder import build_hier_tree
|
||||
|
||||
|
||||
def _is_jump_target(node) -> bool:
|
||||
"""jump-target = 비-window leaf OR %_split parent + 제목 보유 (resolveAnchorMap / _JUMP_TARGET_PRED 일치)."""
|
||||
structural = (node.is_leaf and node.node_type != "window") or bool(
|
||||
node.node_type and node.node_type.endswith("_split"))
|
||||
return structural and bool(node.section_title)
|
||||
|
||||
|
||||
# cover/TOC org-이름 junk 검출 (g6-t1 high-recall): 회사명 접미사 + 거의-전부-대문자.
|
||||
_JUNK_ORG = re.compile(r"\b(INC\.?|LLC|L\.L\.C|CORP\.?|CO\.,?\s*LTD|CONSULTING|COMPANY|LIMITED|LTD\.?)\b", re.I)
|
||||
_FENCE_ANY = re.compile(r"(?m)^\s{0,3}(```|~~~)")
|
||||
|
||||
|
||||
def _looks_junk(title: str | None) -> bool:
|
||||
if not title:
|
||||
return False
|
||||
if _JUNK_ORG.search(title):
|
||||
return True
|
||||
letters = [c for c in title if c.isalpha()]
|
||||
if len(letters) >= 6 and sum(1 for c in letters if c.isupper()) / len(letters) >= 0.85:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def _make_engine():
|
||||
return create_async_engine(os.environ["DATABASE_URL"], pool_pre_ping=True)
|
||||
|
||||
|
||||
async def _measure_doc(session, doc_id):
|
||||
md = await session.scalar(text("SELECT md_content FROM documents WHERE id=:d"), {"d": doc_id})
|
||||
stored = (await session.execute(text("""
|
||||
SELECT chunk_index, chunk_content_hash, node_type, is_leaf, section_title, char_start
|
||||
FROM document_chunks WHERE doc_id=:d AND source_type='hier_section'
|
||||
ORDER BY chunk_index"""), {"d": doc_id})).mappings().all()
|
||||
if not stored:
|
||||
return None
|
||||
res = {"doc_id": doc_id, "n_stored": len(stored)}
|
||||
if not md or not md.strip():
|
||||
res.update({"md_null": True, "verdict": "A_better", "b_jumptarget": 0,
|
||||
"hash_stable_strict": False, "refined_pass": False})
|
||||
return res
|
||||
|
||||
nodes = build_hier_tree(md)
|
||||
jt = [n for n in nodes if _is_jump_target(n)]
|
||||
titles = [n.section_title for n in jt]
|
||||
res["n_build"] = len(nodes)
|
||||
res["b_jumptarget"] = len(jt)
|
||||
res["dup_title"] = len(titles) - len(set(titles))
|
||||
res["has_fence"] = bool(_FENCE_ANY.search(md))
|
||||
res["len_md"] = len(md)
|
||||
|
||||
# hash 비교 (position-aligned, runtime g3-tU 기준).
|
||||
if len(nodes) == len(stored):
|
||||
mism = sum(1 for n, s in zip(nodes, stored)
|
||||
if n.chunk_content_hash != s["chunk_content_hash"])
|
||||
frac = (len(stored) - mism) / len(stored)
|
||||
res["hash_match_frac"] = round(frac, 4)
|
||||
res["hash_stable_strict"] = (mism == 0)
|
||||
res["hash_stable_99"] = (frac >= 0.99)
|
||||
else:
|
||||
res["hash_match_frac"] = 0.0
|
||||
res["hash_stable_strict"] = False
|
||||
res["hash_stable_99"] = False
|
||||
|
||||
stored_titles = {s["section_title"] for s in stored if s["section_title"]}
|
||||
res["junk_b"] = any(_looks_junk(n.section_title) and n.section_title not in stored_titles for n in nodes)
|
||||
|
||||
# verdict 휴리스틱 (high-recall junk 보호 + absent-structure → A_better).
|
||||
# MEASURE2 가 canonical 분포를 이미 박제 — 이 verdict 는 재현/감사용. 애매(notes:ambiguous)는 PASS 미차단.
|
||||
n_a = sum(1 for s in stored if s["is_leaf"])
|
||||
n_b = res["b_jumptarget"]
|
||||
if n_b == 0:
|
||||
res["verdict"] = "A_better" # B 개요 없음(빈 jump-target)
|
||||
elif res["junk_b"]:
|
||||
res["verdict"] = "A_better" # B 가 cover junk 도입
|
||||
elif n_b >= max(1, n_a * 0.7):
|
||||
res["verdict"] = "B_better" if n_b > n_a else "equivalent"
|
||||
else:
|
||||
res["verdict"] = "A_better" # B 가 구조 상실(5209 absent-class)
|
||||
res["notes"] = "absent_or_degraded"
|
||||
|
||||
res["refined_pass"] = res["verdict"] in ("B_better", "equivalent") and n_b >= 1
|
||||
return res
|
||||
|
||||
|
||||
async def cmd_run(args):
|
||||
doc_ids = [int(x) for x in args.doc.split(",") if x.strip()] if args.doc else None
|
||||
engine = _make_engine()
|
||||
sm = async_sessionmaker(engine, expire_on_commit=False)
|
||||
try:
|
||||
async with sm() as session:
|
||||
if doc_ids is None:
|
||||
doc_ids = [r[0] for r in (await session.execute(text(
|
||||
"SELECT DISTINCT doc_id FROM document_chunks WHERE source_type='hier_section' ORDER BY doc_id"))).all()]
|
||||
results = []
|
||||
for d in doc_ids:
|
||||
r = await _measure_doc(session, d)
|
||||
if r is not None:
|
||||
results.append(r)
|
||||
finally:
|
||||
await engine.dispose()
|
||||
|
||||
total = len(results)
|
||||
md_null = [r for r in results if r.get("md_null")]
|
||||
measured = [r for r in results if not r.get("md_null")]
|
||||
passes = [r for r in measured if r.get("refined_pass")]
|
||||
pass_jt0 = [r for r in measured if r["verdict"] in ("B_better", "equivalent") and r["b_jumptarget"] == 0]
|
||||
hash_stable = [r for r in passes if r.get("hash_stable_strict")]
|
||||
hash_stable_99 = [r for r in passes if r.get("hash_stable_99")]
|
||||
hash_changed = [r for r in passes if not r.get("hash_stable_strict")]
|
||||
verdict_dist = Counter(r["verdict"] for r in measured)
|
||||
dup_among_stable = [r for r in hash_stable if r.get("dup_title", 0) > 0]
|
||||
fence_among_stable = [r for r in hash_stable if r.get("has_fence")]
|
||||
|
||||
print("=" * 64)
|
||||
print(f"hier doc 측정: {total} (md_null {len(md_null)}, measured {len(measured)})")
|
||||
print(f"verdict 분포: {dict(verdict_dist)}")
|
||||
print(f"B_jumptarget==0 (PASS-verdict 이나 빈 jump-target, B3 HOLD): {len(pass_jt0)}")
|
||||
print("-" * 64)
|
||||
print(f"REFINED PASS = (verdict B>=A) AND (B_jumptarget>=1): {len(passes)}")
|
||||
print(f" ├─ hash_stable (strict 100% position 재현 = g3-tU UPDATE-only): {len(hash_stable)}")
|
||||
print(f" │ dup_title>0: {len(dup_among_stable)} / has_fence: {len(fence_among_stable)}")
|
||||
print(f" │ (참고) hash_stable_99(원 MEASURE2 기준): {len(hash_stable_99)}")
|
||||
print(f" └─ hash_changed (re-decompose 대상, g3-t2 --reprocess): {len(hash_changed)} ← ★ '230' 재확인 수치")
|
||||
print("-" * 64)
|
||||
print(f" re-decompose --doc(B_jumptarget>=1) = {','.join(str(r['doc_id']) for r in hash_changed) or '(없음)'}")
|
||||
print(f" UPDATE-only --doc(hash_stable) = {','.join(str(r['doc_id']) for r in hash_stable) or '(없음)'}")
|
||||
if md_null:
|
||||
print(f" md_null(suspect, V4): {[r['doc_id'] for r in md_null]}")
|
||||
print("=" * 64)
|
||||
print("NOTE: '230' 은 hash_changed PASS 수치. 코드펜스-skip 으로 hash_stable 32 중 fence 보유분(measure3=2)이 "
|
||||
"hash_changed 로 flip 가능 → 230~232 수용(NEW-3 budget-only, 정확성은 g3-tU 런타임 100% VERIFY 가 보증).")
|
||||
|
||||
if args.json:
|
||||
with open(args.json, "w") as f:
|
||||
json.dump({"summary": {
|
||||
"total": total, "measured": len(measured), "refined_pass": len(passes),
|
||||
"hash_stable": len(hash_stable), "hash_changed": len(hash_changed),
|
||||
"b_jumptarget_0": len(pass_jt0), "md_null": [r["doc_id"] for r in md_null],
|
||||
"hash_changed_doc_ids": [r["doc_id"] for r in hash_changed],
|
||||
"hash_stable_doc_ids": [r["doc_id"] for r in hash_stable],
|
||||
}, "docs": results}, f, ensure_ascii=False, indent=2)
|
||||
print(f"[json] {args.json} 기록 ({len(results)} doc)")
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(description="hier 개요 keep-better 게이트 + g-measure (read-only)")
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
p = sub.add_parser("run", help="전체(또는 --doc) 측정 + 분포 출력")
|
||||
p.add_argument("--doc", default=None, help="comma-sep doc id (미지정=전 hier doc)")
|
||||
p.add_argument("--json", default=None, help="per-doc 결과 JSON 덤프 경로")
|
||||
args = ap.parse_args()
|
||||
asyncio.run({"run": cmd_run}[args.cmd](args))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -29,6 +29,7 @@ from sqlalchemy.ext.asyncio import async_sessionmaker, create_async_engine
|
||||
|
||||
from ai.client import AIClient, parse_json_response, strip_thinking
|
||||
from core.config import settings
|
||||
from services.hier_decomp.builder import build_hier_tree
|
||||
from services.hier_decomp.persist import persist_hier_tree
|
||||
from services.search.llm_gate import Priority, acquire_mlx_gate
|
||||
|
||||
@@ -42,27 +43,48 @@ DOC_MIN_CHARS = 4000 # hier 분해가 의미 있는 doc 크기 하한(STRUCTUR
|
||||
BUFFER_MIN = 10 # deadline 이 만큼 전 안전 중단
|
||||
|
||||
|
||||
def _candidate_sql(allowlist, doc_ids=None):
|
||||
"""allowlist 있으면 그 domain 만, 없으면 EXCLUDE_DOMAINS(news) 제외 전부.
|
||||
doc_ids 명시 시 = 그 doc 만(크기 게이트 DOC_MIN_CHARS + domain 필터 우회 —
|
||||
구조화 소형 문서(법령 등) eval coverage 보정용. NOT EXISTS hier 멱등 가드는 유지).
|
||||
작은 doc 먼저 = 완료 doc 수 최대화 + 단일 mega-doc 예산 독식 방지."""
|
||||
# jump-target = 비-window leaf OR %_split parent (B1/B3 완료마커 + B_jumptarget 분모, 플랜 g3-t2).
|
||||
# 이 집합만 char_start 를 받는다(window-child/preamble 은 설계상 NULL).
|
||||
_JUMP_TARGET_PRED = r"((c.is_leaf AND c.node_type IS DISTINCT FROM 'window') OR c.node_type LIKE '%\_split' ESCAPE '\')"
|
||||
|
||||
|
||||
def _candidate_sql(allowlist, doc_ids=None, reprocess=False):
|
||||
"""body = d.md_content (g0-t1: hier 출처 md_content 영구확정 — extracted_text 폐기. char_start 가
|
||||
md_content offset 이라 FE splice basis 와 일치해야 하므로 분해 source 도 md_content 여야 함[F1]).
|
||||
|
||||
reprocess=False (additive): 아직 hier 없는 doc 만 신규 분해 (NOT EXISTS hier_section 멱등).
|
||||
reprocess=True (re-decompose): hier 는 있으나 jump-target char_start 가 아직 안 채워진 doc 재분해.
|
||||
[B1] 완료마커 = jump-target 중 char_start NOT NULL 행이 존재(=한 번 재분해되면 atomic 하게 전부 채워짐);
|
||||
window-child/preamble 은 설계상 NULL 이라 'all-leaf NOT NULL' 마커의 무한 trap 을 피한다.
|
||||
[B3] 빈 jump-target doc(B_jumptarget==0)은 NOT EXISTS 가 vacuous TRUE → 영구 재선택 trap →
|
||||
호출측이 --doc 을 REFINED PASS(B_jumptarget>=1) 로 제한해 차단(--reprocess 는 --doc 필수, REFUSE).
|
||||
doc_ids 명시 시 크기 게이트 우회. 작은 doc 먼저 = 완료 doc 수 최대화."""
|
||||
if doc_ids:
|
||||
cond, gate = "d.id = ANY(:doc_ids)", "" # 명시 doc = 크기 게이트 우회
|
||||
else:
|
||||
cond = ("lower(split_part(coalesce(d.ai_domain,''), '/', 1)) = ANY(:domains)"
|
||||
if allowlist else
|
||||
"lower(split_part(coalesce(d.ai_domain,''), '/', 1)) <> ALL(:exclude)")
|
||||
gate = "AND length(d.extracted_text) > :minchars"
|
||||
gate = "AND length(d.md_content) > :minchars"
|
||||
if reprocess:
|
||||
marker = f"""
|
||||
AND EXISTS (SELECT 1 FROM document_chunks dc
|
||||
WHERE dc.doc_id = d.id AND dc.source_type = 'hier_section')
|
||||
AND NOT EXISTS (SELECT 1 FROM document_chunks c
|
||||
WHERE c.doc_id = d.id AND c.source_type = 'hier_section'
|
||||
AND c.char_start IS NOT NULL AND {_JUMP_TARGET_PRED})"""
|
||||
else:
|
||||
marker = """
|
||||
AND NOT EXISTS (SELECT 1 FROM document_chunks dc
|
||||
WHERE dc.doc_id = d.id AND dc.source_type = 'hier_section')"""
|
||||
return text(f"""
|
||||
SELECT d.id AS doc_id, d.extracted_text AS body, d.ai_domain AS ai_domain
|
||||
SELECT d.id AS doc_id, d.md_content AS body, d.ai_domain AS ai_domain
|
||||
FROM documents d
|
||||
WHERE d.extracted_text IS NOT NULL
|
||||
WHERE d.md_content IS NOT NULL AND length(d.md_content) > 0
|
||||
{gate}
|
||||
AND {cond}
|
||||
AND NOT EXISTS (SELECT 1 FROM document_chunks dc
|
||||
WHERE dc.doc_id = d.id AND dc.source_type = 'hier_section')
|
||||
ORDER BY length(d.extracted_text) ASC
|
||||
{marker}
|
||||
ORDER BY length(d.md_content) ASC
|
||||
""")
|
||||
|
||||
|
||||
@@ -77,10 +99,11 @@ def _candidate_params(allowlist, doc_ids=None):
|
||||
return p
|
||||
|
||||
|
||||
def _scope_label(allowlist, doc_ids=None):
|
||||
def _scope_label(allowlist, doc_ids=None, reprocess=False):
|
||||
tag = "RE-DECOMPOSE" if reprocess else "additive"
|
||||
if doc_ids:
|
||||
return f"doc-list={len(doc_ids)}건(크기게이트 우회)"
|
||||
return f"allowlist={allowlist}" if allowlist else f"all-except={EXCLUDE_DOMAINS}"
|
||||
return f"doc-list={len(doc_ids)}건(크기게이트 우회, {tag})"
|
||||
return (f"allowlist={allowlist}" if allowlist else f"all-except={EXCLUDE_DOMAINS}") + f" ({tag})"
|
||||
|
||||
# 멱등 leaf 선별 (재실행 시 이미 분석된 leaf 제외)
|
||||
LEAF_SQL = text("""
|
||||
@@ -177,14 +200,19 @@ def _parse_doc_ids(args):
|
||||
async def cmd_dry_run(args):
|
||||
allowlist = args.domains.split(",") if args.domains else None
|
||||
doc_ids = _parse_doc_ids(args)
|
||||
reprocess = getattr(args, "reprocess", False)
|
||||
if reprocess and not doc_ids:
|
||||
print("REFUSE: --reprocess 는 --doc <list> 필수 (B3 빈 jump-target trap 차단 — REFINED PASS 리스트만)")
|
||||
sys.exit(2)
|
||||
engine = _make_engine()
|
||||
sm = async_sessionmaker(engine, expire_on_commit=False)
|
||||
async with sm() as session:
|
||||
rows = (await session.execute(_candidate_sql(allowlist, doc_ids),
|
||||
rows = (await session.execute(_candidate_sql(allowlist, doc_ids, reprocess),
|
||||
_candidate_params(allowlist, doc_ids))).mappings().all()
|
||||
await engine.dispose()
|
||||
gate_lbl = "doc-list" if doc_ids else f">{DOC_MIN_CHARS}자"
|
||||
print(f"[dry-run] 후보 doc {len(rows)} ({_scope_label(allowlist, doc_ids)}, {gate_lbl}, 미분해)")
|
||||
state_lbl = "재분해 미완료(jump-target char_start 부재)" if reprocess else "미분해"
|
||||
print(f"[dry-run] 후보 doc {len(rows)} ({_scope_label(allowlist, doc_ids, reprocess)}, {gate_lbl}, {state_lbl})")
|
||||
if rows:
|
||||
lens = [len(r["body"]) for r in rows]
|
||||
print(f" 본문길이: min={min(lens)} p50={int(statistics.median(lens))} max={max(lens)} 합={sum(lens):,}")
|
||||
@@ -196,11 +224,16 @@ async def cmd_dry_run(args):
|
||||
async def cmd_run(args):
|
||||
allowlist = args.domains.split(",") if args.domains else None
|
||||
doc_ids = _parse_doc_ids(args)
|
||||
reprocess = getattr(args, "reprocess", False)
|
||||
if reprocess and not doc_ids:
|
||||
_log("REFUSE: --reprocess 는 --doc <list> 필수 (B3 빈 jump-target trap 차단 — REFINED PASS 리스트만)")
|
||||
sys.exit(2)
|
||||
skip_analysis = getattr(args, "skip_analysis", False)
|
||||
deadline = _compute_deadline(args.deadline)
|
||||
stop_at = (deadline - timedelta(minutes=BUFFER_MIN)).timestamp()
|
||||
_log(f"deadline={deadline:%m-%d %H:%M} (buffer {BUFFER_MIN}m → stop_at={datetime.fromtimestamp(stop_at):%H:%M}) "
|
||||
f"{_scope_label(allowlist, doc_ids)}{' [SKIP-ANALYSIS: 분해+임베딩만]' if skip_analysis else ''}")
|
||||
f"{_scope_label(allowlist, doc_ids, reprocess)}{' [SKIP-ANALYSIS: 분해+임베딩만]' if skip_analysis else ''}"
|
||||
f"{' [RE-DECOMPOSE: 기존 hier DELETE→CASCADE chunk_section_analysis→재INSERT; 스냅샷 선행 필수]' if reprocess else ''}")
|
||||
|
||||
engine = _make_engine()
|
||||
sm = async_sessionmaker(engine, expire_on_commit=False)
|
||||
@@ -219,7 +252,7 @@ async def cmd_run(args):
|
||||
run_start = time.time()
|
||||
try:
|
||||
async with sm() as session:
|
||||
cands = (await session.execute(_candidate_sql(allowlist, doc_ids),
|
||||
cands = (await session.execute(_candidate_sql(allowlist, doc_ids, reprocess),
|
||||
_candidate_params(allowlist, doc_ids))).mappings().all()
|
||||
_log(f"후보 doc {len(cands)} 선별. 시작.")
|
||||
|
||||
@@ -268,6 +301,101 @@ async def cmd_run(args):
|
||||
d = Counter(all_types)
|
||||
_log(f" section_type: {dict(d.most_common())} other={d.get('other',0)/len(all_types):.1%}")
|
||||
|
||||
# [g3-t3/g3-t4] post-run sweep: 처리한 doc 중 미분석 leaf 잔여 집계(반쪽상태/stall 검출).
|
||||
# GOAL(jump=char_start)/rail-summary(re-analyze) DECOUPLE — 잔여는 다음 실행이 LEAF_SQL 멱등으로 흡수.
|
||||
if doc_ids:
|
||||
try:
|
||||
async with sm() as session:
|
||||
pending = (await session.execute(text(f"""
|
||||
SELECT dc.doc_id, count(*) AS unanalyzed
|
||||
FROM document_chunks dc
|
||||
WHERE dc.doc_id = ANY(:ids) AND dc.source_type='hier_section' AND dc.is_leaf=true
|
||||
AND NOT EXISTS (SELECT 1 FROM chunk_section_analysis a
|
||||
WHERE a.chunk_id = dc.id AND a.prompt_version = :pv
|
||||
AND a.source_content_hash = dc.chunk_content_hash)
|
||||
GROUP BY dc.doc_id ORDER BY unanalyzed DESC"""),
|
||||
{"ids": doc_ids, "pv": PROMPT_VERSION})).mappings().all()
|
||||
if pending:
|
||||
tot = sum(r["unanalyzed"] for r in pending)
|
||||
_log(f" [sweep] 미분석 leaf 잔여: {tot} (doc {len(pending)}) — 다음 실행이 이어서 분석(멱등). "
|
||||
f"상위: {[(r['doc_id'], r['unanalyzed']) for r in pending[:5]]}")
|
||||
else:
|
||||
_log(" [sweep] 미분석 leaf 잔여 0 — 분석 수렴.")
|
||||
except Exception as exc:
|
||||
_log(f" [sweep] 잔여 집계 실패(무해): {type(exc).__name__}")
|
||||
|
||||
|
||||
def _is_jump_target(node) -> bool:
|
||||
"""jump-target = 비-window leaf OR %_split parent (builder HierNode 판정, _JUMP_TARGET_PRED 와 일치)."""
|
||||
return ((node.is_leaf and node.node_type != "window")
|
||||
or bool(node.node_type and node.node_type.endswith("_split")))
|
||||
|
||||
|
||||
async def cmd_update_char_start(args):
|
||||
"""[g3-tU] hash_stable doc 전용 비파괴 char_start UPDATE.
|
||||
|
||||
각 doc: build(md_content) → stored hier 행과 position-by-position(chunk_index 순) 정렬 →
|
||||
[NEW-1] jump-target 전수 100% hash 일치(ALL-OR-NOTHING) VERIFY. 단 한 자리라도 불일치 → DEMOTE.
|
||||
[NEW-2] hash 로 WHERE 하지 않음(동일-body 절 충돌 회피) — position 의 stored row PK(id)로 UPDATE.
|
||||
통과 doc: UPDATE document_chunks SET char_start (DELETE/CASCADE/embed/analyze 0, 가역).
|
||||
미달 doc: DEMOTE-LIST 로 emit → re-decompose 배치에 UNION(NEW-4). stdout 마지막에 DEMOTE_DOC_IDS= 출력.
|
||||
"""
|
||||
doc_ids = _parse_doc_ids(args)
|
||||
if not doc_ids:
|
||||
_log("REFUSE: update-char-start 는 --doc <list> 필수 (hash_stable 32 = gm-t1 산출)")
|
||||
sys.exit(2)
|
||||
engine = _make_engine()
|
||||
sm = async_sessionmaker(engine, expire_on_commit=False)
|
||||
updated, demoted, noop = [], [], []
|
||||
try:
|
||||
for doc_id in doc_ids:
|
||||
async with sm() as session:
|
||||
md = await session.scalar(text("SELECT md_content FROM documents WHERE id=:d"), {"d": doc_id})
|
||||
if not md or not md.strip():
|
||||
noop.append(doc_id)
|
||||
_log(f" doc={doc_id} md_content 없음 → no-op(suspect, V4)")
|
||||
continue
|
||||
nodes = build_hier_tree(md)
|
||||
stored = (await session.execute(text("""
|
||||
SELECT id, chunk_index, chunk_content_hash, node_type, is_leaf
|
||||
FROM document_chunks
|
||||
WHERE doc_id=:d AND source_type='hier_section'
|
||||
ORDER BY chunk_index"""), {"d": doc_id})).mappings().all()
|
||||
# [NEW-2] position 정렬: build node[i] ↔ stored[i] (chunk_index = base + idx 라 동일 순서).
|
||||
# 노드 수가 다르면 구조 변경 = hash_changed → DEMOTE.
|
||||
if len(nodes) != len(stored):
|
||||
demoted.append(doc_id)
|
||||
_log(f" doc={doc_id} 노드수 build {len(nodes)} ≠ stored {len(stored)} → DEMOTE(re-decompose)")
|
||||
continue
|
||||
# [NEW-1] 전 position hash 일치 VERIFY (position-alignment 가 ordering 도 검증).
|
||||
# 임의 position 불일치 → DEMOTE (jump-target 1% miss 도 whole-doc 폴백 회귀를 부르므로 100%).
|
||||
mismatch = next((i for i, (nd, sr) in enumerate(zip(nodes, stored))
|
||||
if nd.chunk_content_hash != sr["chunk_content_hash"]), None)
|
||||
if mismatch is not None:
|
||||
demoted.append(doc_id)
|
||||
_log(f" doc={doc_id} position {mismatch} hash 불일치 → DEMOTE(re-decompose, NEW-1)")
|
||||
continue
|
||||
# 통과 → jump-target 의 char_start 를 stored row PK 로 UPDATE.
|
||||
n_upd = 0
|
||||
for nd, sr in zip(nodes, stored):
|
||||
if _is_jump_target(nd) and nd.char_start is not None:
|
||||
await session.execute(
|
||||
text("UPDATE document_chunks SET char_start=:cs WHERE id=:id"),
|
||||
{"cs": nd.char_start, "id": sr["id"]})
|
||||
n_upd += 1
|
||||
await session.commit()
|
||||
updated.append(doc_id)
|
||||
_log(f" ✓ doc={doc_id} char_start UPDATE {n_upd} jump-target (VERIFY 100%, 비파괴)")
|
||||
finally:
|
||||
await engine.dispose()
|
||||
_log(f"=== update-char-start: updated={len(updated)} demoted={len(demoted)} noop={len(noop)} ===")
|
||||
if demoted:
|
||||
_log(f" DEMOTE(re-decompose 배치 합류, NEW-4): {demoted}")
|
||||
if noop:
|
||||
_log(f" NO-OP(md_content NULL suspect, V4): {noop}")
|
||||
# 기계가독: re-decompose --doc = (gm-t1 hash_changed 230) UNION (이 리스트)
|
||||
print("DEMOTE_DOC_IDS=" + ",".join(str(x) for x in demoted), flush=True)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(description="오버나이트 hier 분해+절 분석 backfill (additive)")
|
||||
@@ -275,13 +403,20 @@ def main():
|
||||
p_dry = sub.add_parser("dry-run", help="후보 doc 집계 (작업 0)")
|
||||
p_dry.add_argument("--domains", default=None, help="comma-sep allowlist (미지정=뉴스 제외 전부)")
|
||||
p_dry.add_argument("--doc", default=None, help="comma-sep doc id (크기 게이트 우회 — 구조화 소형 문서 coverage 보정)")
|
||||
p_dry.add_argument("--reprocess", action="store_true", help="재분해 후보(기존 hier+jump-target char_start 부재) — --doc 필수")
|
||||
p_run = sub.add_parser("run", help="분해+분석 실행 (deadline time-box)")
|
||||
p_run.add_argument("--deadline", default="07:00", help="HH:MM (기본 07:00 — 컨테이너 UTC 주의, 07:00 KST=22:00 UTC)")
|
||||
p_run.add_argument("--domains", default=None, help="comma-sep allowlist (미지정=뉴스 제외 전부)")
|
||||
p_run.add_argument("--doc", default=None, help="comma-sep doc id (크기 게이트 우회 — 구조화 소형 문서 coverage 보정)")
|
||||
p_run.add_argument("--skip-analysis", action="store_true", help="절 분석(Mac mini) 생략, 분해+임베딩만 (retrieval go/no-go 측정 준비용)")
|
||||
p_run.add_argument("--reprocess", action="store_true",
|
||||
help="[g3-t2] RE-DECOMPOSE: 기존 hier DELETE→CASCADE→재INSERT (md_content 출처, char_start). "
|
||||
"--doc(REFINED PASS hash_changed∪demote) 필수 / 스냅샷 선행 필수")
|
||||
p_upd = sub.add_parser("update-char-start",
|
||||
help="[g3-tU] hash_stable doc 비파괴 char_start UPDATE (100% VERIFY, --doc 필수)")
|
||||
p_upd.add_argument("--doc", default=None, help="comma-sep doc id (gm-t1 hash_stable 32)")
|
||||
args = ap.parse_args()
|
||||
fn = {"dry-run": cmd_dry_run, "run": cmd_run}[args.cmd]
|
||||
fn = {"dry-run": cmd_dry_run, "run": cmd_run, "update-char-start": cmd_update_char_start}[args.cmd]
|
||||
asyncio.run(fn(args))
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user