a67df0a10b
phase-2a-embedding-diagnose.md v4 § 7 Phase 2 산출. 페어 invariant (R2-2): documents_cand + document_chunks_cand 동기 swap, 부분 swap 금지. - snapshot 박제 (R2-D): v0_2_phase2a_snapshot_2026-05-23.json - SNAPSHOT_DOC_ID_MAX=25180 / SNAPSHOT_CHUNK_ID_MAX=56526 - documents_n=21365 (embedded, active) / chunks_n=30605 - production ingest 정지 0, 모든 candidate reindex + baseline rebaseline 측정이 id<=snapshot 한정 - reindex_candidate.py 신규 (R2-5): - reindex_documents(): production _build_embed_input() import 재사용 - reindex_chunks(): document_chunks.text 그대로 (재 chunking 0) - TEI batch=8 (1.7 internal queue overflow 회피) + truncate=true (mE5 512 context) - retry-8 exponential backoff (10/20/40/80/90s) — TEI SIGSEGV 자동 복구 - idempotent ON CONFLICT DO NOTHING (cancellation/resume 안전) - docker-compose.override.cand.yml: restart=unless-stopped (TEI 1.7 panic 자동 복구) DB 산출물 (4 테이블): - documents_cand_me5_large_inst : 21365 rows (dim 1024) + ivfflat lists=100 - document_chunks_cand_me5_large_inst : 30605 rows (dim 1024) + ivfflat lists=100 - documents_cand_snowflake_l_v2 : 21365 rows (dim 1024) + ivfflat lists=100 - document_chunks_cand_snowflake_l_v2 : 30605 rows (dim 1024) + ivfflat lists=100 - ivfflat.probes=20 (production 동일) 보존 - smoke retrieval (nearest neighbor SQL) PASS 후보 2종 production 영향: - documents / document_chunks 컬럼/row 변경 0 - config.yaml 변경 0 (ollama bge-m3 unchanged) - production fastapi/postgres/reranker 변경 0 (profile embed-cand 격리) 다음 단계: Phase 3 (DS API + retrieval_service slug-based dispatcher 추가, baseline rebaseline + 2 후보 51 case 측정). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>