Files
hyungi_document_server/scripts/phase4_health.sql
T
Hyungi Ahn 8074be6b6d feat(study): Phase 4-D 운영 관찰 + confidence calibration
Phase 4-B v1 첫 검증 결과 자료 부족 토픽인데도 모델이 confidence='high'
박는 케이스 발견. 정의 (high = 자료 + 다른 ai_explanation 으로 패턴 명확)
보다 과신 — UX 신뢰도 위험. 자동 cap 보정 + 운영 관찰 SQL 추가.

confidence calibration (services/study/session_summary_guard):
- calibrate_confidence(c, ctx_docs_count, ready_explanation_count) 신규
  · ctx_docs_count == 0 AND ready_explanation_count == 0 → 'low' cap
  · ctx_docs_count == 0 (ready 만 있음)  → 'medium' cap
  · ctx_docs_count >= 1                  → 모델 값 그대로
- 모델이 정의보다 더 보수적인 값 박은 경우 (모델 'low' + cap 'medium') 는
  보존 — 더 보수적인 값을 절대 올리지 않음

worker 적용 (study_session_analysis_worker):
- ctx_docs_count = len(ctx_docs)
- ready_explanation_count = sum(1 for a in prompt_attempts if a.get('ai_explanation'))
- calibrate_confidence 호출 → study_quiz_session_analysis.confidence 박힘
- job.payload 에 운영 분석 metadata 보존:
  · ctx_docs_count / ready_explanation_count
  · model_confidence_raw (모델 응답) vs calibrated_confidence (cap 후)
  · prompt_attempts / valid_attempts_total / summary_len
  → SQL 4 번 쿼리가 cap 작동 빈도 측정

scripts/phase4_health.sql (신규 운영 점검 SQL 7 섹션):
1. 4-A study_question_jobs status × error_code 분포
2. 4-B study_quiz_session_jobs status × error_code 분포
3. 4-B confidence 분포 (calibrated)
4. 4-B model_confidence_raw vs calibrated 차이 (cap 작동 빈도)
5. 4-A/4-B 최근 7일 처리 지연 p50/p95/max/avg
6. 4-A/4-B skipped 사유 분포
7. 4-B guard_fail / parse_fail / llm_timeout 비율

ship gate (단위 테스트):
- test_calibrate_confidence_no_evidence_caps_to_low (3 케이스)
- test_calibrate_confidence_only_explanations_caps_to_medium (3 케이스)
- test_calibrate_confidence_with_documents_passthrough (3 케이스)
- test_calibrate_confidence_normalizes_invalid_first (2 케이스)

Plan: ~/.claude/plans/nifty-sparking-spindle.md (Phase 4-B v1 후속)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 07:33:57 +09:00

120 lines
4.0 KiB
SQL
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
-- Phase 4 운영 점검 SQL — 4-A (study_question_jobs) + 4-B (study_quiz_session_jobs)
-- 사용:
-- ssh gpu 'docker exec -i hyungi_document_server-postgres-1 psql -U pkm pkm' < scripts/phase4_health.sql
-- 또는 개별 SECTION 만 골라 실행. 모든 섹션은 read-only.
\echo '── 1. 4-A study_question_jobs status × error_code 분포 ──'
SELECT
status,
COALESCE(error_code, '(none)') AS error_code,
COUNT(*) AS cnt
FROM study_question_jobs
GROUP BY status, error_code
ORDER BY status, error_code;
\echo ''
\echo '── 2. 4-B study_quiz_session_jobs status × error_code 분포 ──'
SELECT
status,
COALESCE(error_code, '(none)') AS error_code,
COUNT(*) AS cnt
FROM study_quiz_session_jobs
GROUP BY status, error_code
ORDER BY status, error_code;
\echo ''
\echo '── 3. 4-B study_quiz_session_analysis confidence 분포 (calibrated) ──'
SELECT
COALESCE(confidence, '(null)') AS confidence,
COUNT(*) AS cnt,
COUNT(*) FILTER (WHERE is_stale) AS stale_count
FROM study_quiz_session_analysis
GROUP BY confidence
ORDER BY
CASE COALESCE(confidence, '(null)')
WHEN 'high' THEN 0
WHEN 'medium' THEN 1
WHEN 'low' THEN 2
ELSE 3
END;
\echo ''
\echo '── 4. 4-B confidence calibration 차이 (job.payload 기반) ──'
\echo ' model_confidence_raw vs calibrated_confidence — 자료 부족 cap 작동 빈도 측정'
SELECT
payload->>'model_confidence_raw' AS model_raw,
payload->>'calibrated_confidence' AS calibrated,
(payload->>'ctx_docs_count')::int AS docs_n,
(payload->>'ready_explanation_count')::int AS ready_n,
COUNT(*) AS cnt
FROM study_quiz_session_jobs
WHERE status = 'completed'
AND payload IS NOT NULL
AND payload ? 'model_confidence_raw'
GROUP BY model_raw, calibrated, docs_n, ready_n
ORDER BY cnt DESC
LIMIT 20;
\echo ''
\echo '── 5. 4-A/4-B 최근 7일 처리 지연 (created_at → completed_at) ──'
\echo ' p50/p95/max 단순 ROUND(EXTRACT). 4-A 와 4-B 분리.'
SELECT
'study_question_jobs' AS source,
COUNT(*) AS terminal_n,
ROUND(AVG(EXTRACT(EPOCH FROM (completed_at - created_at)))::numeric, 1) AS avg_sec,
ROUND(MAX(EXTRACT(EPOCH FROM (completed_at - created_at)))::numeric, 1) AS max_sec,
ROUND((PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (completed_at - created_at))
))::numeric, 1) AS p50_sec,
ROUND((PERCENTILE_CONT(0.95) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (completed_at - created_at))
))::numeric, 1) AS p95_sec
FROM study_question_jobs
WHERE created_at >= NOW() - INTERVAL '7 days'
AND completed_at IS NOT NULL
UNION ALL
SELECT
'study_quiz_session_jobs',
COUNT(*),
ROUND(AVG(EXTRACT(EPOCH FROM (completed_at - created_at)))::numeric, 1),
ROUND(MAX(EXTRACT(EPOCH FROM (completed_at - created_at)))::numeric, 1),
ROUND((PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (completed_at - created_at))
))::numeric, 1),
ROUND((PERCENTILE_CONT(0.95) WITHIN GROUP (
ORDER BY EXTRACT(EPOCH FROM (completed_at - created_at))
))::numeric, 1)
FROM study_quiz_session_jobs
WHERE created_at >= NOW() - INTERVAL '7 days'
AND completed_at IS NOT NULL;
\echo ''
\echo '── 6. 4-A/4-B skipped 사유 분포 (어떤 데이터 부족이 가장 많이 막는가) ──'
SELECT
'study_question_jobs' AS source,
error_code,
COUNT(*) AS cnt
FROM study_question_jobs
WHERE status = 'skipped'
GROUP BY error_code
UNION ALL
SELECT
'study_quiz_session_jobs',
error_code,
COUNT(*)
FROM study_quiz_session_jobs
WHERE status = 'skipped'
GROUP BY error_code
ORDER BY source, cnt DESC;
\echo ''
\echo '── 7. 4-B guard_fail / parse_fail / llm_timeout 비율 (전체 job 대비) ──'
SELECT
error_code,
COUNT(*) AS cnt,
ROUND(100.0 * COUNT(*) / NULLIF((SELECT COUNT(*) FROM study_quiz_session_jobs), 0), 1) AS pct
FROM study_quiz_session_jobs
WHERE error_code IN ('guard_fail', 'parse_fail', 'llm_timeout', 'unknown')
GROUP BY error_code
ORDER BY cnt DESC;