431d4fe010
야간 수집 뉴스 (KST 00:00~05:00) topic×country 비교 분석 1페이지 카드.
Phase 4 Global Digest 와 코드/로직/테이블 분리, 알고리즘만 services/clustering_common 공유.
Backend 신규:
- migrations/255_morning_briefings.sql: morning_briefings + briefing_topics
(briefing_date UNIQUE, UNIQUE(briefing_id,topic_rank), FK CASCADE,
historical_* 3컬럼 nullable, cluster_members JSONB, country_perspectives
JSONB, status 4-state success|partial|failed|empty)
- app/models/briefing.py: SQLAlchemy ORM
- app/services/briefing/loader.py: KST 5h 윈도우 + news_sources prefix
fallback (Phase 4 패턴 미러) + historical candidate pool 로더
- app/services/briefing/clustering.py: cluster_global topic-first
(LAMBDA=ln(2)/2h, MIN_COUNTRIES_PER_TOPIC=2, MAX_TOPICS=7)
- app/services/briefing/comparator.py: call_primary 26B + JSON envelope
sanitize (cap perspectives 10 / divergences 3 / convergences 2 /
quotes 5) + fallback row 고정 형태 + retrieve_historical cosine top-K
- app/services/briefing/pipeline.py: load→cluster→select(K=7,λ=0.6)
→historical→compare→status 4-state→delete+insert transaction
- app/workers/briefing_worker.py: APScheduler/수동 호출 공용 진입점,
600s hard cap
- app/prompts/briefing_comparative.txt: 한국어 비교 분석 JSON 프롬프트,
{articles_block} + {historical_block} 2섹션, 인용 금지 라벨
- app/api/briefing.py: GET /latest, GET ?date=, POST /regenerate?date=
(admin, sync delete+insert tx, regenerated:true)
Backend 수정:
- app/main.py: briefing_router 등록 (/api/briefing prefix). scheduler
등록은 PR-3 에서.
- app/services/digest/selection.py: select_for_llm 매개변수화 (K, λ
caller 주입). Phase 4 동작은 default 값으로 보존.
Historical 정책:
- BRIEFING_HISTORICAL_ENABLED env flag, default off.
- flag off → historical_* 컬럼 모두 NULL, prompt {historical_block} 빈
라벨, retrieval 호출 안 함.
- flag on (PR-1b 에서 enable) → cluster centroid 와 과거 30일 doc
embedding cosine top-K 5 (sim≥0.70), prompt 에 주입.
Country canonical (실측 확인 후):
- documents.country 컬럼 부재 확정
- document_chunks.country 매칭률 0% (chunks 자체가 뉴스에 안 만들어짐)
- 유일 country 신호 = news_sources prefix 매핑 (Phase 4 와 동일)
Tests:
- tests/test_briefing_historical.py: 3 경로 회귀 (flag off/on with
fixture/on zero match) + sanitize cap + fallback row 형태.
Verification: PR-1.8 에서 GPU 컨테이너 pytest + 수동 regenerate.
200 lines
6.3 KiB
Python
200 lines
6.3 KiB
Python
"""야간 5h 수집 뉴스 윈도우 로드 + country 정규화 + (옵션) 과거 N일 후보 로드.
|
|
|
|
- KST 자정~05:00 사이 수집된 documents (source_channel='news' OR ai_domain='News').
|
|
- country canonical = document_chunks.country first non-null → news_sources prefix fallback (Phase 4 동일).
|
|
- ai_summary/embedding NULL 제외 (재요약/재임베딩 0회 원칙).
|
|
- 반환: doc dict 의 list (topic-first cluster 입력. country 는 각 dict 의 field).
|
|
- 과거 retrieval 용 historical doc 후보는 별도 함수 (BRIEFING_HISTORICAL_ENABLED on 시).
|
|
"""
|
|
|
|
from datetime import datetime
|
|
from typing import Any
|
|
|
|
import numpy as np
|
|
from sqlalchemy import text
|
|
|
|
from core.database import async_session
|
|
from core.utils import setup_logger
|
|
|
|
logger = setup_logger("briefing_loader")
|
|
|
|
|
|
_NEWS_WINDOW_SQL = text("""
|
|
SELECT
|
|
d.id,
|
|
d.title,
|
|
d.ai_summary,
|
|
d.embedding,
|
|
d.created_at,
|
|
d.edit_url,
|
|
d.ai_sub_group,
|
|
(
|
|
SELECT c.country
|
|
FROM document_chunks c
|
|
WHERE c.doc_id = d.id AND c.country IS NOT NULL
|
|
LIMIT 1
|
|
) AS chunk_country
|
|
FROM documents d
|
|
WHERE (d.source_channel = 'news' OR d.ai_domain = 'News')
|
|
AND d.deleted_at IS NULL
|
|
AND d.created_at >= :window_start
|
|
AND d.created_at < :window_end
|
|
AND d.embedding IS NOT NULL
|
|
AND d.ai_summary IS NOT NULL
|
|
""")
|
|
|
|
|
|
_SOURCE_COUNTRY_SQL = text("""
|
|
SELECT name, country FROM news_sources WHERE country IS NOT NULL
|
|
""")
|
|
|
|
|
|
_HISTORICAL_CANDIDATES_SQL = text("""
|
|
SELECT
|
|
d.id,
|
|
d.title,
|
|
d.ai_summary,
|
|
d.embedding,
|
|
d.created_at
|
|
FROM documents d
|
|
WHERE (d.source_channel = 'news' OR d.ai_domain = 'News')
|
|
AND d.deleted_at IS NULL
|
|
AND d.created_at >= :hist_start
|
|
AND d.created_at < :hist_end
|
|
AND d.embedding IS NOT NULL
|
|
AND d.ai_summary IS NOT NULL
|
|
""")
|
|
|
|
|
|
def _to_numpy_embedding(raw: Any) -> np.ndarray | None:
|
|
if raw is None:
|
|
return None
|
|
if isinstance(raw, str):
|
|
import json
|
|
try:
|
|
raw = json.loads(raw)
|
|
except json.JSONDecodeError:
|
|
return None
|
|
try:
|
|
arr = np.asarray(raw, dtype=np.float32)
|
|
except (TypeError, ValueError):
|
|
return None
|
|
if arr.size == 0:
|
|
return None
|
|
return arr
|
|
|
|
|
|
async def _load_source_country_map(session) -> dict[str, str]:
|
|
"""news_sources name → country prefix 매핑 (Phase 4 패턴 미러)."""
|
|
rows = await session.execute(_SOURCE_COUNTRY_SQL)
|
|
mapping: dict[str, str] = {}
|
|
for name, country in rows:
|
|
if not name or not country:
|
|
continue
|
|
prefix = name.split(" ")[0].strip()
|
|
if prefix and prefix not in mapping:
|
|
mapping[prefix] = country
|
|
tokens = name.split(" ")
|
|
if len(tokens) >= 3:
|
|
source_prefix = " ".join(tokens[:-1]).strip()
|
|
if source_prefix and source_prefix not in mapping:
|
|
mapping[source_prefix] = country
|
|
return mapping
|
|
|
|
|
|
async def load_night_window(
|
|
window_start: datetime,
|
|
window_end: datetime,
|
|
) -> list[dict]:
|
|
"""야간 윈도우 뉴스 docs 를 country 채워진 list 로 반환.
|
|
|
|
Returns:
|
|
[{id, title, ai_summary, embedding, created_at, edit_url, ai_sub_group, country}, ...]
|
|
country 매핑 실패한 doc 은 drop (cross-country 비교가 핵심이므로).
|
|
"""
|
|
docs: list[dict] = []
|
|
null_country = 0
|
|
|
|
async with async_session() as session:
|
|
source_country = await _load_source_country_map(session)
|
|
|
|
result = await session.execute(
|
|
_NEWS_WINDOW_SQL,
|
|
{"window_start": window_start, "window_end": window_end},
|
|
)
|
|
for row in result.mappings():
|
|
embedding = _to_numpy_embedding(row["embedding"])
|
|
if embedding is None:
|
|
continue
|
|
|
|
country = row["chunk_country"]
|
|
if not country:
|
|
ai_sub_group = (row["ai_sub_group"] or "").strip()
|
|
if ai_sub_group:
|
|
country = source_country.get(ai_sub_group)
|
|
if not country:
|
|
null_country += 1
|
|
continue
|
|
|
|
docs.append({
|
|
"id": int(row["id"]),
|
|
"title": row["title"] or "",
|
|
"ai_summary": row["ai_summary"] or "",
|
|
"embedding": embedding,
|
|
"created_at": row["created_at"],
|
|
"edit_url": row["edit_url"] or "",
|
|
"ai_sub_group": row["ai_sub_group"] or "",
|
|
"country": country.upper(),
|
|
})
|
|
|
|
if null_country:
|
|
logger.warning(
|
|
f"[loader] country 매핑 실패 drop {null_country}건 "
|
|
f"(chunk_country + news_sources prefix 둘 다 fail)"
|
|
)
|
|
logger.info(
|
|
f"[loader] night window {window_start} ~ {window_end} → "
|
|
f"{len(docs)}건 ({len({d['country'] for d in docs})}개 국가)"
|
|
)
|
|
return docs
|
|
|
|
|
|
async def load_historical_candidates(
|
|
hist_start: datetime,
|
|
hist_end: datetime,
|
|
exclude_ids: set[int],
|
|
) -> list[dict]:
|
|
"""과거 N일 doc 후보 (BRIEFING_HISTORICAL_ENABLED=true 시만 호출).
|
|
|
|
cluster centroid 와 cosine 비교용 raw candidate pool. country 매핑 안 함
|
|
(LLM 분석 input 으로만 사용하고 표시 안 함).
|
|
|
|
Args:
|
|
exclude_ids: 오늘 윈도우 article id (중복 retrieval 회피).
|
|
|
|
Returns:
|
|
[{id, title, ai_summary, embedding, created_at}, ...]
|
|
"""
|
|
out: list[dict] = []
|
|
async with async_session() as session:
|
|
result = await session.execute(
|
|
_HISTORICAL_CANDIDATES_SQL,
|
|
{"hist_start": hist_start, "hist_end": hist_end},
|
|
)
|
|
for row in result.mappings():
|
|
doc_id = int(row["id"])
|
|
if doc_id in exclude_ids:
|
|
continue
|
|
embedding = _to_numpy_embedding(row["embedding"])
|
|
if embedding is None:
|
|
continue
|
|
out.append({
|
|
"id": doc_id,
|
|
"title": row["title"] or "",
|
|
"ai_summary": row["ai_summary"] or "",
|
|
"embedding": embedding,
|
|
"created_at": row["created_at"],
|
|
})
|
|
logger.info(f"[loader] historical candidates: {len(out)} docs (window {hist_start.date()} ~ {hist_end.date()})")
|
|
return out
|