fix(search): Phase 1.2-C doc-level aggregation으로 다양성 회복
Phase 1.2-C 평가셋: Recall 0.788 → 0.531, natural_language 0.73 → 0.07.
진단:
단순 chunk top-N(limit*5=25)으로 raw chunks 가져왔는데 같은 doc의
여러 chunks가 상위에 몰림 → unique doc 다양성 붕괴.
warm test debug: 'chunks raw=16 compressed=5 unique_docs=10'
해결 (사용자 추천 C):
Window function ROW_NUMBER() PARTITION BY doc_id로 doc당 top 2 chunks만 반환.
SQL 흐름:
1. inner CTE topk: ivfflat 인덱스로 top inner_k chunks 빠르게
(inner_k = max(limit*10, 200))
2. ranked CTE: PARTITION BY doc_id ORDER BY dist ROW_NUMBER
3. outer: rn <= 2 (doc당 max 2 chunks) + JOIN documents
4. limit = limit * 4 (chunks 단위, ~limit*2 unique docs)
reranker 호환:
doc당 max 2 chunks 그대로 반환 → chunks_by_doc 보존
compress_chunks_to_docs는 그대로 동작 (best chunk per doc)
Phase 1.3 reranker가 chunks_by_doc에서 raw chunks 회수 가능
핵심 원칙: vector retrieval은 chunk로 찾고 doc으로 선택해야 한다.
This commit is contained in:
@@ -121,16 +121,24 @@ async def search_text(
|
|||||||
async def search_vector(
|
async def search_vector(
|
||||||
session: AsyncSession, query: str, limit: int
|
session: AsyncSession, query: str, limit: int
|
||||||
) -> list["SearchResult"]:
|
) -> list["SearchResult"]:
|
||||||
"""벡터 유사도 검색 — chunk-level (Phase 1.2-C).
|
"""벡터 유사도 검색 — chunk-level + doc 다양성 보장 (Phase 1.2-C).
|
||||||
|
|
||||||
document_chunks 테이블에서 cosine similarity로 raw chunks 반환.
|
Phase 1.2-C 진단:
|
||||||
같은 doc에서 여러 chunks가 들어올 수 있음 (압축 안 함).
|
단순 chunk top-N 가져오면 같은 doc의 여러 chunks가 상위에 몰려
|
||||||
fusion 직전에 compress_chunks_to_docs() helper로 doc 기준 압축 필요.
|
unique doc 다양성 붕괴 → recall 0.788 → 0.531 (catastrophic).
|
||||||
Phase 1.3 reranker는 raw chunks를 그대로 활용.
|
|
||||||
|
|
||||||
SearchResult.id = doc_id (fusion 호환)
|
해결 (사용자 추천 C 방식):
|
||||||
SearchResult.chunk_id / chunk_index / section_title = chunk 메타
|
Window function으로 doc_id 기준 PARTITION → 각 doc의 top 2 chunks만 반환.
|
||||||
snippet = chunk의 text 앞 200자
|
raw_chunks(chunks_by_doc 보존)와 doc-level 압축 둘 다 만족.
|
||||||
|
|
||||||
|
SQL 흐름:
|
||||||
|
1. inner CTE: ivfflat 인덱스로 top-K chunks 빠르게 추출
|
||||||
|
2. ranked CTE: doc_id PARTITION 후 score 내림차순 ROW_NUMBER
|
||||||
|
3. outer: rn <= 2 (doc당 max 2 chunks) + JOIN documents
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
list[SearchResult] — chunk-level, 각 doc 최대 2개. compress_chunks_to_docs로
|
||||||
|
doc-level 압축 + chunks_by_doc 보존.
|
||||||
"""
|
"""
|
||||||
from api.search import SearchResult # 순환 import 회피
|
from api.search import SearchResult # 순환 import 회피
|
||||||
|
|
||||||
@@ -141,29 +149,49 @@ async def search_vector(
|
|||||||
except Exception:
|
except Exception:
|
||||||
return []
|
return []
|
||||||
|
|
||||||
# raw chunks를 doc 메타와 join. limit * 5 정도 넓게 → 압축 후 doc 다양성.
|
# ivfflat 인덱스로 top-K chunks 추출 후 doc 단위 partition
|
||||||
fetch_limit = limit * 5
|
# inner_k = limit * 10 정도로 충분 unique doc 확보 (~30~50 docs)
|
||||||
|
inner_k = max(limit * 10, 200)
|
||||||
result = await session.execute(
|
result = await session.execute(
|
||||||
text("""
|
text("""
|
||||||
|
WITH topk AS (
|
||||||
|
SELECT
|
||||||
|
c.id AS chunk_id,
|
||||||
|
c.doc_id,
|
||||||
|
c.chunk_index,
|
||||||
|
c.section_title,
|
||||||
|
c.text,
|
||||||
|
c.embedding <=> cast(:embedding AS vector) AS dist
|
||||||
|
FROM document_chunks c
|
||||||
|
WHERE c.embedding IS NOT NULL
|
||||||
|
ORDER BY c.embedding <=> cast(:embedding AS vector)
|
||||||
|
LIMIT :inner_k
|
||||||
|
),
|
||||||
|
ranked AS (
|
||||||
|
SELECT
|
||||||
|
chunk_id, doc_id, chunk_index, section_title, text, dist,
|
||||||
|
ROW_NUMBER() OVER (PARTITION BY doc_id ORDER BY dist ASC) AS rn
|
||||||
|
FROM topk
|
||||||
|
)
|
||||||
SELECT
|
SELECT
|
||||||
d.id AS id,
|
d.id AS id,
|
||||||
d.title AS title,
|
d.title AS title,
|
||||||
d.ai_domain AS ai_domain,
|
d.ai_domain AS ai_domain,
|
||||||
d.ai_summary AS ai_summary,
|
d.ai_summary AS ai_summary,
|
||||||
d.file_format AS file_format,
|
d.file_format AS file_format,
|
||||||
(1 - (c.embedding <=> cast(:embedding AS vector))) AS score,
|
(1 - r.dist) AS score,
|
||||||
left(c.text, 200) AS snippet,
|
left(r.text, 200) AS snippet,
|
||||||
'vector' AS match_reason,
|
'vector' AS match_reason,
|
||||||
c.id AS chunk_id,
|
r.chunk_id AS chunk_id,
|
||||||
c.chunk_index AS chunk_index,
|
r.chunk_index AS chunk_index,
|
||||||
c.section_title AS section_title
|
r.section_title AS section_title
|
||||||
FROM document_chunks c
|
FROM ranked r
|
||||||
JOIN documents d ON d.id = c.doc_id
|
JOIN documents d ON d.id = r.doc_id
|
||||||
WHERE c.embedding IS NOT NULL AND d.deleted_at IS NULL
|
WHERE r.rn <= 2 AND d.deleted_at IS NULL
|
||||||
ORDER BY c.embedding <=> cast(:embedding AS vector)
|
ORDER BY r.dist
|
||||||
LIMIT :limit
|
LIMIT :limit
|
||||||
"""),
|
"""),
|
||||||
{"embedding": str(query_embedding), "limit": fetch_limit},
|
{"embedding": str(query_embedding), "inner_k": inner_k, "limit": limit * 4},
|
||||||
)
|
)
|
||||||
return [SearchResult(**row._mapping) for row in result]
|
return [SearchResult(**row._mapping) for row in result]
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user