hyungi_document_server

Author	SHA1	Message	Date
Hyungi Ahn	76e723cdb1	feat(search): Phase 1.3 TEI reranker 통합 (코드 골격) 데이터 흐름 원칙: fusion=doc 기준 / reranker=chunk 기준 — 절대 섞지 말 것. 신규/수정: - ai/client.py: rerank() 메서드 추가 (TEI POST /rerank API) - services/search/rerank_service.py: - rerank_chunks() — asyncio.Semaphore(2) + 5s soft timeout + RRF fallback - _make_snippet/_extract_window — title + query 중심 200~400 토큰 (keyword 매치 없으면 첫 800자 fallback) - apply_diversity() — max_per_doc=2, top score>=0.90 unlimited - warmup_reranker() — 10회 retry + 3초 간격 (TEI 모델 로딩 대기) - MAX_RERANK_INPUT=200, MAX_CHUNKS_PER_DOC=2 hard cap - services/search_telemetry.py: compute_confidence_reranked() — sigmoid score 임계값 - api/search.py: - ?rerank=true\|false 파라미터 (기본 true, hybrid 모드만) - 흐름: fused_docs(limit*5) → chunks_by_doc 회수 → rerank_chunks → apply_diversity - text-only 매치 doc은 doc 자체를 chunk처럼 wrap (fallback) - rerank 활성 시 confidence는 reranker score 기반 - tests/search_eval/run_eval.py: --rerank true\|false 플래그 GPU 적용 보류: - TEI 컨테이너 추가 (docker-compose.yml) — 별도 작업 - config.yaml rerank.endpoint 갱신 — GPU 직접 (commit 없음) - 재인덱싱 완료 후 build + warmup + 평가셋 측정	2026-04-08 12:41:47 +09:00
Hyungi Ahn	161ff18a31	feat(search): Phase 0.5 RRF fusion + 강한 신호 boost 기존 weighted-sum merge를 Reciprocal Rank Fusion으로 교체. 정확 키워드 매치에서 RRF가 평탄화되는 문제는 boost로 보완. 신규 모듈 app/services/search_fusion.py: - FusionStrategy ABC - LegacyWeightedSum : 기존 _merge_results 동작 (A/B 비교용) - RRFOnly : 순수 RRF, k=60 - RRFWithBoost : RRF + title/tags/법령조문/high-text-score boost (default) - normalize_display_scores: SearchResult.score를 [0..1] 랭크 기반 정규화 (프론트엔드가 score*100을 % 표시하므로 RRF 원본 점수 노출 시 표시 깨짐) search.py: - ?fusion=legacy\|rrf\|rrf_boost 파라미터 (default rrf_boost) - _merge_results 제거 (LegacyWeightedSum에 흡수) - pre-fusion confidence: hybrid는 raw text/vector 신호로 계산 (fused score는 fusion 전략마다 스케일이 달라 일관 비교 불가) - timing에 fusion_ms 추가 - debug notes에 fusion 전략 표시 telemetry: - compute_confidence_hybrid(text_results, vector_results) 헬퍼 - record_search_event에 confidence override 파라미터 run_eval.py: - --fusion CLI 옵션, call_search 쿼리 파라미터에 전달 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 08:58:33 +09:00
Hyungi Ahn	70b27d4a51	fix(search): confidence 임계값 완화 + hybrid +vector boost 가산 baseline 평가셋 실행 시 'summary+vector' top_score 2.39가 임계값 2.5에 미달해 정답 쿼리(산업안전보건법 제6장)가 low_confidence로 잘못 잡힘. - 텍스트 매치 임계값 0.5씩 완화 (실측 분포 반영) - '+vector' 접미사가 있으면 hybrid 합성 매치이므로 confidence +0.10 가산 - 정답률 5/5 → 4/5 false-positive 1건 제거 기대 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 08:37:13 +09:00
Hyungi Ahn	50e6b5ad90	fix(search): confidence 휴리스틱 vector-only amplify 버그 수정 vector-only 매치(match_reason == 'vector')에서 raw 코사인 0.43이 0.6으로 잘못 amplify되어 low_confidence threshold(0.5)를 못 넘기던 문제. - vector-only 분기: amplify 제거, _cosine_to_confidence로 일관 환산 - _cosine_to_confidence: bge-m3 코사인 분포 (무관 텍스트 ~0.4) 반영 - 코사인 0.55 = threshold 경계(0.50), 0.45 미만은 명확히 low smoke test 결과 zzzqxywvkpqxnj1234 같은 무의미 쿼리(top cosine 0.43)가 low_confidence로 잡히지 않던 문제 해결. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 08:33:25 +09:00
Hyungi Ahn	f005922483	feat(search): Phase 0.3 검색 실패 자동 로깅 검색 실패 케이스를 자동 수집해 gold dataset 시드로 활용. wiggly-weaving-puppy 플랜 Phase 0.3 산출물. 자동 수집 트리거 (3가지): - result_count == 0 → no_result - confidence < 0.5 → low_confidence - 60초 내 동일 사용자 재쿼리 → user_reformulated (이전 쿼리 기록) confidence는 Phase 0.3 휴리스틱 (top score + match_reason). Phase 2 QueryAnalyzer 도입 후 LLM 기반으로 교체 예정. 구현: - migrations/015_search_failure_logs.sql: 테이블 + 3개 인덱스 - app/models/search_failure.py: ORM - app/services/search_telemetry.py: confidence 계산 + recent 트래커 + INSERT - app/api/search.py: BackgroundTasks로 dispatch (응답 latency 영향 X) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 08:29:12 +09:00

5 Commits