Files
hyungi_document_server/app/services/digest/selection.py
Hyungi Ahn 431d4fe010 feat(briefing): add morning briefing schema + services + api (historical off)
야간 수집 뉴스 (KST 00:00~05:00) topic×country 비교 분석 1페이지 카드.
Phase 4 Global Digest 와 코드/로직/테이블 분리, 알고리즘만 services/clustering_common 공유.

Backend 신규:
- migrations/255_morning_briefings.sql: morning_briefings + briefing_topics
  (briefing_date UNIQUE, UNIQUE(briefing_id,topic_rank), FK CASCADE,
  historical_* 3컬럼 nullable, cluster_members JSONB, country_perspectives
  JSONB, status 4-state success|partial|failed|empty)
- app/models/briefing.py: SQLAlchemy ORM
- app/services/briefing/loader.py: KST 5h 윈도우 + news_sources prefix
  fallback (Phase 4 패턴 미러) + historical candidate pool 로더
- app/services/briefing/clustering.py: cluster_global topic-first
  (LAMBDA=ln(2)/2h, MIN_COUNTRIES_PER_TOPIC=2, MAX_TOPICS=7)
- app/services/briefing/comparator.py: call_primary 26B + JSON envelope
  sanitize (cap perspectives 10 / divergences 3 / convergences 2 /
  quotes 5) + fallback row 고정 형태 + retrieve_historical cosine top-K
- app/services/briefing/pipeline.py: load→cluster→select(K=7,λ=0.6)
  →historical→compare→status 4-state→delete+insert transaction
- app/workers/briefing_worker.py: APScheduler/수동 호출 공용 진입점,
  600s hard cap
- app/prompts/briefing_comparative.txt: 한국어 비교 분석 JSON 프롬프트,
  {articles_block} + {historical_block} 2섹션, 인용 금지 라벨
- app/api/briefing.py: GET /latest, GET ?date=, POST /regenerate?date=
  (admin, sync delete+insert tx, regenerated:true)

Backend 수정:
- app/main.py: briefing_router 등록 (/api/briefing prefix). scheduler
  등록은 PR-3 에서.
- app/services/digest/selection.py: select_for_llm 매개변수화 (K, λ
  caller 주입). Phase 4 동작은 default 값으로 보존.

Historical 정책:
- BRIEFING_HISTORICAL_ENABLED env flag, default off.
- flag off → historical_* 컬럼 모두 NULL, prompt {historical_block} 빈
  라벨, retrieval 호출 안 함.
- flag on (PR-1b 에서 enable) → cluster centroid 와 과거 30일 doc
  embedding cosine top-K 5 (sim≥0.70), prompt 에 주입.

Country canonical (실측 확인 후):
- documents.country 컬럼 부재 확정
- document_chunks.country 매칭률 0% (chunks 자체가 뉴스에 안 만들어짐)
- 유일 country 신호 = news_sources prefix 매핑 (Phase 4 와 동일)

Tests:
- tests/test_briefing_historical.py: 3 경로 회귀 (flag off/on with
  fixture/on zero match) + sanitize cap + fallback row 형태.

Verification: PR-1.8 에서 GPU 컨테이너 pytest + 수동 regenerate.
2026-05-12 12:58:50 +09:00

64 lines
2.1 KiB
Python

"""Cluster 내 LLM 입력 선정 — top-k + MMR diversity + ai_summary truncate.
순수 top-relevance 는 동일 사건 중복 요약문에 편향되므로 MMR 로 다양성 확보.
ai_summary 길이는 LLM 토큰 보호를 위해 SUMMARY_TRUNCATE 로 제한.
"""
import numpy as np
from services.clustering_common import normalize_vector as _normalize
K_PER_CLUSTER = 5
LAMBDA_MMR = 0.7 # relevance 70% / diversity 30%
SUMMARY_TRUNCATE = 300 # long tail ai_summary 방어
def select_for_llm(
cluster: dict,
k: int = K_PER_CLUSTER,
*,
lambda_mmr: float = LAMBDA_MMR,
summary_truncate: int = SUMMARY_TRUNCATE,
) -> list[dict]:
"""cluster 내 LLM 호출용 대표 article 들 선정.
Args:
cluster: clustering.cluster_country / briefing.cluster_global 결과 단일 cluster
k: 선정 개수 (Phase 4=5, briefing=7)
lambda_mmr: relevance vs diversity (Phase 4=0.7, briefing=0.6)
summary_truncate: ai_summary 자르기 길이 (LLM 토큰 보호)
Returns:
선정된 doc dict 리스트. 각 항목에 ai_summary_truncated 필드가 추가됨.
"""
members = cluster["members"]
if len(members) <= k:
selected = list(members)
else:
centroid = cluster["centroid"]
for m in members:
v = _normalize(m["embedding"])
m["_rel"] = float(np.dot(centroid, v)) * m["weight"]
first = max(members, key=lambda x: x["_rel"])
selected = [first]
candidates = [m for m in members if m is not first]
while len(selected) < k and candidates:
def mmr_score(c: dict) -> float:
v = _normalize(c["embedding"])
max_sim = max(
float(np.dot(v, _normalize(s["embedding"])))
for s in selected
)
return lambda_mmr * c["_rel"] - (1.0 - lambda_mmr) * max_sim
pick = max(candidates, key=mmr_score)
selected.append(pick)
candidates.remove(pick)
for m in selected:
m["ai_summary_truncated"] = (m.get("ai_summary") or "")[:summary_truncate]
return selected