431d4fe010
야간 수집 뉴스 (KST 00:00~05:00) topic×country 비교 분석 1페이지 카드.
Phase 4 Global Digest 와 코드/로직/테이블 분리, 알고리즘만 services/clustering_common 공유.
Backend 신규:
- migrations/255_morning_briefings.sql: morning_briefings + briefing_topics
(briefing_date UNIQUE, UNIQUE(briefing_id,topic_rank), FK CASCADE,
historical_* 3컬럼 nullable, cluster_members JSONB, country_perspectives
JSONB, status 4-state success|partial|failed|empty)
- app/models/briefing.py: SQLAlchemy ORM
- app/services/briefing/loader.py: KST 5h 윈도우 + news_sources prefix
fallback (Phase 4 패턴 미러) + historical candidate pool 로더
- app/services/briefing/clustering.py: cluster_global topic-first
(LAMBDA=ln(2)/2h, MIN_COUNTRIES_PER_TOPIC=2, MAX_TOPICS=7)
- app/services/briefing/comparator.py: call_primary 26B + JSON envelope
sanitize (cap perspectives 10 / divergences 3 / convergences 2 /
quotes 5) + fallback row 고정 형태 + retrieve_historical cosine top-K
- app/services/briefing/pipeline.py: load→cluster→select(K=7,λ=0.6)
→historical→compare→status 4-state→delete+insert transaction
- app/workers/briefing_worker.py: APScheduler/수동 호출 공용 진입점,
600s hard cap
- app/prompts/briefing_comparative.txt: 한국어 비교 분석 JSON 프롬프트,
{articles_block} + {historical_block} 2섹션, 인용 금지 라벨
- app/api/briefing.py: GET /latest, GET ?date=, POST /regenerate?date=
(admin, sync delete+insert tx, regenerated:true)
Backend 수정:
- app/main.py: briefing_router 등록 (/api/briefing prefix). scheduler
등록은 PR-3 에서.
- app/services/digest/selection.py: select_for_llm 매개변수화 (K, λ
caller 주입). Phase 4 동작은 default 값으로 보존.
Historical 정책:
- BRIEFING_HISTORICAL_ENABLED env flag, default off.
- flag off → historical_* 컬럼 모두 NULL, prompt {historical_block} 빈
라벨, retrieval 호출 안 함.
- flag on (PR-1b 에서 enable) → cluster centroid 와 과거 30일 doc
embedding cosine top-K 5 (sim≥0.70), prompt 에 주입.
Country canonical (실측 확인 후):
- documents.country 컬럼 부재 확정
- document_chunks.country 매칭률 0% (chunks 자체가 뉴스에 안 만들어짐)
- 유일 country 신호 = news_sources prefix 매핑 (Phase 4 와 동일)
Tests:
- tests/test_briefing_historical.py: 3 경로 회귀 (flag off/on with
fixture/on zero match) + sanitize cap + fallback row 형태.
Verification: PR-1.8 에서 GPU 컨테이너 pytest + 수동 regenerate.
64 lines
2.1 KiB
Python
64 lines
2.1 KiB
Python
"""Cluster 내 LLM 입력 선정 — top-k + MMR diversity + ai_summary truncate.
|
|
|
|
순수 top-relevance 는 동일 사건 중복 요약문에 편향되므로 MMR 로 다양성 확보.
|
|
ai_summary 길이는 LLM 토큰 보호를 위해 SUMMARY_TRUNCATE 로 제한.
|
|
"""
|
|
|
|
import numpy as np
|
|
|
|
from services.clustering_common import normalize_vector as _normalize
|
|
|
|
K_PER_CLUSTER = 5
|
|
LAMBDA_MMR = 0.7 # relevance 70% / diversity 30%
|
|
SUMMARY_TRUNCATE = 300 # long tail ai_summary 방어
|
|
|
|
|
|
def select_for_llm(
|
|
cluster: dict,
|
|
k: int = K_PER_CLUSTER,
|
|
*,
|
|
lambda_mmr: float = LAMBDA_MMR,
|
|
summary_truncate: int = SUMMARY_TRUNCATE,
|
|
) -> list[dict]:
|
|
"""cluster 내 LLM 호출용 대표 article 들 선정.
|
|
|
|
Args:
|
|
cluster: clustering.cluster_country / briefing.cluster_global 결과 단일 cluster
|
|
k: 선정 개수 (Phase 4=5, briefing=7)
|
|
lambda_mmr: relevance vs diversity (Phase 4=0.7, briefing=0.6)
|
|
summary_truncate: ai_summary 자르기 길이 (LLM 토큰 보호)
|
|
|
|
Returns:
|
|
선정된 doc dict 리스트. 각 항목에 ai_summary_truncated 필드가 추가됨.
|
|
"""
|
|
members = cluster["members"]
|
|
if len(members) <= k:
|
|
selected = list(members)
|
|
else:
|
|
centroid = cluster["centroid"]
|
|
for m in members:
|
|
v = _normalize(m["embedding"])
|
|
m["_rel"] = float(np.dot(centroid, v)) * m["weight"]
|
|
|
|
first = max(members, key=lambda x: x["_rel"])
|
|
selected = [first]
|
|
candidates = [m for m in members if m is not first]
|
|
|
|
while len(selected) < k and candidates:
|
|
def mmr_score(c: dict) -> float:
|
|
v = _normalize(c["embedding"])
|
|
max_sim = max(
|
|
float(np.dot(v, _normalize(s["embedding"])))
|
|
for s in selected
|
|
)
|
|
return lambda_mmr * c["_rel"] - (1.0 - lambda_mmr) * max_sim
|
|
|
|
pick = max(candidates, key=mmr_score)
|
|
selected.append(pick)
|
|
candidates.remove(pick)
|
|
|
|
for m in selected:
|
|
m["ai_summary_truncated"] = (m.get("ai_summary") or "")[:summary_truncate]
|
|
|
|
return selected
|