Files
hyungi_document_server/migrations/101_global_digests.sql
Hyungi Ahn 75a1919342 feat(digest): Phase 4 Global News Digest (cluster-level batch summarization)
7일 rolling window 뉴스를 country × topic 2-level로 묶어 매일 04:00 KST 배치 생성.
search 파이프라인 미사용. documents → clustering → cluster-level LLM summarization → digest.

핵심 결정:
- adaptive threshold (0.75/0.78/0.80) + EMA centroid (α=0.7) + time-decay (λ=ln(2)/3)
- min_articles=3, max_topics=10/country, top-5 MMR diversity, ai_summary[:300] truncate
- cluster-level LLM only, drop금지 fallback (topic_label="주요 뉴스 묶음" + top member ai_summary[:200])
- importance_score country별 0~1 normalize + raw_weight_sum 별도 보존, max(score, 0.01) floor
- per-call timeout 25s + pipeline hard cap 600s
- DELETE+INSERT idempotent (UNIQUE digest_date), AIClient._call_chat 직접 호출 (client.py 수정 없음)

신규:
- migrations/101_global_digests.sql (2테이블 정규화)
- app/models/digest.py (GlobalDigest + DigestTopic ORM)
- app/services/digest/{loader,clustering,selection,summarizer,pipeline}.py
- app/workers/digest_worker.py (PIPELINE_HARD_CAP + CLI 진입점)
- app/api/digest.py (/latest, ?date|country, /regenerate, inline Pydantic)
- app/prompts/digest_topic.txt (JSON-only + 절대 금지 블록)

main.py 4줄: import 2 + scheduler add_job 1 + include_router 1.
plan: ~/.claude/plans/quiet-herding-tome.md
2026-04-09 07:45:11 +09:00

58 lines
3.2 KiB
SQL
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
-- Phase 4 Global News Digest
-- 7일 rolling window 뉴스를 country × topic 2-level로 묶어 매일 새벽 4시 KST 배치 생성
-- 검색 파이프라인 미사용. documents → clustering → cluster-level LLM summarization → digest
-- 사용자 결정: country→topic 2-level, cluster-level LLM only, drop 금지 fallback,
-- adaptive threshold, EMA centroid, time-decay (λ=ln(2)/3 ≈ 0.231)
-- 부모 테이블: 하루 단위 digest run 메타데이터
CREATE TABLE global_digests (
id BIGSERIAL PRIMARY KEY,
digest_date DATE NOT NULL, -- KST 기준 생성일
window_start TIMESTAMPTZ NOT NULL, -- rolling window 시작 (UTC)
window_end TIMESTAMPTZ NOT NULL, -- 생성 시점 (UTC)
decay_lambda DOUBLE PRECISION NOT NULL, -- 실제 사용된 time-decay λ
total_articles INTEGER NOT NULL DEFAULT 0,
total_countries INTEGER NOT NULL DEFAULT 0,
total_topics INTEGER NOT NULL DEFAULT 0,
generation_ms INTEGER, -- 워커 실행 시간 (성능 회귀 감지)
llm_calls INTEGER NOT NULL DEFAULT 0,
llm_failures INTEGER NOT NULL DEFAULT 0, -- = fallback 사용 횟수
status VARCHAR(20) NOT NULL DEFAULT 'success', -- success | partial | failed
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (digest_date) -- idempotency: 같은 날짜 재실행 시 DELETE+INSERT
);
CREATE INDEX idx_global_digests_date ON global_digests (digest_date DESC);
-- 자식 테이블: country × topic 단위
CREATE TABLE digest_topics (
id BIGSERIAL PRIMARY KEY,
digest_id BIGINT NOT NULL REFERENCES global_digests(id) ON DELETE CASCADE,
country VARCHAR(10) NOT NULL, -- KR | US | JP | CN | FR | DE | ...
topic_rank INTEGER NOT NULL, -- country 내 1..N (importance_score 내림차순)
topic_label TEXT NOT NULL, -- LLM 생성 5~10 단어 한국어 (또는 fallback 시 "주요 뉴스 묶음")
summary TEXT NOT NULL, -- LLM 생성 1~2 문장 factual (또는 fallback 시 top member ai_summary[:200])
article_ids JSONB NOT NULL, -- [doc_id, ...] 코드가 주입 (LLM 생성 금지)
article_count INTEGER NOT NULL, -- = jsonb_array_length(article_ids)
importance_score DOUBLE PRECISION NOT NULL, -- batch 내 country별 0~1 normalized (cross-country 비교)
raw_weight_sum DOUBLE PRECISION NOT NULL, -- 정규화 전 decay 가중합 (디버그 + day-over-day 트렌드)
centroid_sample JSONB, -- 디버그: LLM 입력 doc id 목록 + summary hash
llm_model VARCHAR(100), -- 사용된 모델 (primary/fallback 추적)
llm_fallback_used BOOLEAN NOT NULL DEFAULT FALSE, -- LLM 실패 시 minimal fallback 적용 여부
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_digest_topics_digest ON digest_topics (digest_id);
CREATE INDEX idx_digest_topics_country ON digest_topics (country);
CREATE INDEX idx_digest_topics_rank ON digest_topics (digest_id, country, topic_rank);