7일 rolling window 뉴스를 country × topic 2-level로 묶어 매일 04:00 KST 배치 생성.
search 파이프라인 미사용. documents → clustering → cluster-level LLM summarization → digest.
핵심 결정:
- adaptive threshold (0.75/0.78/0.80) + EMA centroid (α=0.7) + time-decay (λ=ln(2)/3)
- min_articles=3, max_topics=10/country, top-5 MMR diversity, ai_summary[:300] truncate
- cluster-level LLM only, drop금지 fallback (topic_label="주요 뉴스 묶음" + top member ai_summary[:200])
- importance_score country별 0~1 normalize + raw_weight_sum 별도 보존, max(score, 0.01) floor
- per-call timeout 25s + pipeline hard cap 600s
- DELETE+INSERT idempotent (UNIQUE digest_date), AIClient._call_chat 직접 호출 (client.py 수정 없음)
신규:
- migrations/101_global_digests.sql (2테이블 정규화)
- app/models/digest.py (GlobalDigest + DigestTopic ORM)
- app/services/digest/{loader,clustering,selection,summarizer,pipeline}.py
- app/workers/digest_worker.py (PIPELINE_HARD_CAP + CLI 진입점)
- app/api/digest.py (/latest, ?date|country, /regenerate, inline Pydantic)
- app/prompts/digest_topic.txt (JSON-only + 절대 금지 블록)
main.py 4줄: import 2 + scheduler add_job 1 + include_router 1.
plan: ~/.claude/plans/quiet-herding-tome.md
63 lines
2.0 KiB
Python
63 lines
2.0 KiB
Python
"""Cluster 내 LLM 입력 선정 — top-k + MMR diversity + ai_summary truncate.
|
||
|
||
순수 top-relevance 는 동일 사건 중복 요약문에 편향되므로 MMR 로 다양성 확보.
|
||
ai_summary 길이는 LLM 토큰 보호를 위해 SUMMARY_TRUNCATE 로 제한.
|
||
"""
|
||
|
||
import numpy as np
|
||
|
||
K_PER_CLUSTER = 5
|
||
LAMBDA_MMR = 0.7 # relevance 70% / diversity 30%
|
||
SUMMARY_TRUNCATE = 300 # long tail ai_summary 방어
|
||
|
||
|
||
def _normalize(v: np.ndarray) -> np.ndarray:
|
||
norm = float(np.linalg.norm(v))
|
||
if norm == 0.0:
|
||
return v
|
||
return v / norm
|
||
|
||
|
||
def select_for_llm(cluster: dict, k: int = K_PER_CLUSTER) -> list[dict]:
|
||
"""cluster 내 LLM 호출용 대표 article 들 선정.
|
||
|
||
Args:
|
||
cluster: clustering.cluster_country 결과 단일 cluster
|
||
k: 선정 개수 (기본 5)
|
||
|
||
Returns:
|
||
선정된 doc dict 리스트. 각 항목에 ai_summary_truncated 필드가 추가됨.
|
||
"""
|
||
members = cluster["members"]
|
||
if len(members) <= k:
|
||
selected = list(members)
|
||
else:
|
||
centroid = cluster["centroid"]
|
||
# relevance = centroid 유사도 × decay weight
|
||
for m in members:
|
||
v = _normalize(m["embedding"])
|
||
m["_rel"] = float(np.dot(centroid, v)) * m["weight"]
|
||
|
||
first = max(members, key=lambda x: x["_rel"])
|
||
selected = [first]
|
||
candidates = [m for m in members if m is not first]
|
||
|
||
while len(selected) < k and candidates:
|
||
def mmr_score(c: dict) -> float:
|
||
v = _normalize(c["embedding"])
|
||
max_sim = max(
|
||
float(np.dot(v, _normalize(s["embedding"])))
|
||
for s in selected
|
||
)
|
||
return LAMBDA_MMR * c["_rel"] - (1.0 - LAMBDA_MMR) * max_sim
|
||
|
||
pick = max(candidates, key=mmr_score)
|
||
selected.append(pick)
|
||
candidates.remove(pick)
|
||
|
||
# LLM 입력 토큰 보호
|
||
for m in selected:
|
||
m["ai_summary_truncated"] = (m.get("ai_summary") or "")[:SUMMARY_TRUNCATE]
|
||
|
||
return selected
|