feat(search): Phase 2.3 soft_filter boost (domain/doctype)

## 변경

### fusion_service.py
 - SOFT_FILTER_MAX_BOOST = 0.05 (plan 영구 룰, RRF score 왜곡 방지)
 - SOFT_FILTER_DOMAIN_BOOST = 0.03, SOFT_FILTER_DOCTYPE_BOOST = 0.02
 - apply_soft_filter_boost(results, soft_filters) → int
   - ai_domain 부분 문자열 매칭 (path 포함 e.g. "Industrial_Safety/Legislation")
   - document_type 토큰 매칭 (ai_domain + match_reason 헤이스택)
   - 상한선 0.05 강제
   - boost 후 score 기준 재정렬

### api/search.py
 - fusion 직후 호출 조건:
   - analyzer_cache_hit == True
   - analyzer_tier != "ignore" (confidence >= 0.5)
   - query_analysis.soft_filters 존재
 - notes에 "soft_filter_boost applied=N" 기록

## Phase 2.3 범위
 - hard_filter SQL WHERE는 현재 평가셋에 명시 필터 쿼리 없어 효과 측정 불가 → Phase 2.4 v0.2 확장 후
 - document_type의 file_format 직접 매칭은 의미론적 mismatch → 제외
 - hard_filter는 Phase 2.4 이후 iteration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hyungi Ahn
2026-04-08 15:30:23 +09:00
parent e595283e27
commit e91c199537
2 changed files with 83 additions and 1 deletions

View File

@@ -16,7 +16,12 @@ from core.database import get_session
from core.utils import setup_logger
from models.user import User
from services.search import query_analyzer
from services.search.fusion_service import DEFAULT_FUSION, get_strategy, normalize_display_scores
from services.search.fusion_service import (
DEFAULT_FUSION,
apply_soft_filter_boost,
get_strategy,
normalize_display_scores,
)
from services.search.rerank_service import (
MAX_CHUNKS_PER_DOC,
MAX_RERANK_INPUT,
@@ -258,6 +263,19 @@ async def search(
f"unique_docs={len(chunks_by_doc)}"
)
# Phase 2.3: soft_filter boost (cache hit + tier != ignore 일 때만)
# analyzer_confidence < 0.5 (tier=ignore)는 비활성.
if (
analyzer_cache_hit
and analyzer_tier != "ignore"
and query_analysis
):
soft_filters = query_analysis.get("soft_filters") or {}
if soft_filters:
boosted = apply_soft_filter_boost(fused_docs, soft_filters)
if boosted > 0:
notes.append(f"soft_filter_boost applied={boosted}")
if rerank:
# Phase 1.3: reranker — chunk 기준 입력
# fusion 결과 doc_id로 chunks_by_doc에서 raw chunks 회수