feat: 분류 체계 전면 개편 — taxonomy + document_type + confidence

- config.yaml: 6개 domain × 3단계 taxonomy + 13개 document_types 정의
- classify.txt: 영문 프롬프트, taxonomy 경로 기반 분류 + 분류 규칙 주입
- classify_worker: taxonomy 검증, confidence 기반 분류, document_type 저장
- migration 008: document_type, importance, ai_confidence 컬럼
- API: DocumentResponse에 document_type, importance, ai_confidence 추가

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hyungi Ahn
2026-04-03 13:32:20 +09:00
parent 770d38b72c
commit 6d73e7ee12
6 changed files with 227 additions and 67 deletions

View File

@@ -1,51 +1,93 @@
당신은 문서 분류 AI입니다. 아래 문서를 분석하고 반드시 JSON 형식으로만 응답하세요. 다른 텍스트는 출력하지 마세요.
You are a document classification AI. Analyze the document below and respond ONLY in JSON format. No other text.
## 응답 형식
## Response Format
{
"tags": ["태그1", "태그2", "태그3"],
"domain": "도메인경로",
"sub_group": "하위그룹",
"sourceChannel": "유입경로",
"dataOrigin": "work 또는 external"
"domain": "Level1/Level2/Level3",
"document_type": "one of document_types",
"confidence": 0.85,
"tags": ["tag1", "tag2"],
"importance": "medium",
"sourceChannel": "inbox_route",
"dataOrigin": "work or external"
}
## 도메인 선택지 (NAS 폴더 경로)
- Knowledge/Philosophy — 철학, 사상, 인문학
- Knowledge/Language — 어학, 번역, 언어학
- Knowledge/Engineering — 공학 전반 기술 문서
- Knowledge/Industrial_Safety — 산업안전, 규정, 인증
- Knowledge/Programming — 개발, 코드, IT 기술
- Knowledge/General — 일반 도서, 독서 노트, 메모
- Reference — 도면, 참고자료, 규격표
## Domain Taxonomy (select the most specific leaf node)
## 하위 그룹 예시 (도메인별)
- Knowledge/Industrial_Safety: Legislation, Standards, Cases
- Knowledge/Programming: Language, Framework, DevOps, AI_ML
- Knowledge/Engineering: Mechanical, Electrical, Network
- 잘 모르겠으면: (비워둠)
Philosophy/
Ethics, Metaphysics, Epistemology, Logic, Aesthetics, Eastern_Philosophy, Western_Philosophy
## 태그 체계
태그는 최대 5개, 한글 사용. 아래 계층 구조 중에서 선택:
- @상태/: 처리중, 검토필요, 완료, 아카이브
- #주제/기술/: 서버관리, 네트워크, AI-ML
- #주제/산업안전/: 법령, 위험성평가, 순회점검, 안전교육, 사고사례, 신고보고, 안전관리자, 보건관리자
- #주제/업무/: 프로젝트, 회의, 보고서
- $유형/: 논문, 법령, 기사, 메모, 이메일, 채팅로그, 도면, 체크리스트
- !우선순위/: 긴급, 중요, 참고
Language/
Korean, English, Japanese, Translation, Linguistics
## sourceChannel 값
- tksafety: TKSafety API 업무 실적
- devonagent: 자동 수집 뉴스
- law_monitor: 법령 API 법령 변경
- inbox_route: Inbox AI 분류 (이 프롬프트에 의한 분류)
- email: MailPlus 이메일
- web_clip: Web Clipper 스크랩
- manual: 직접 추가
- drive_sync: Synology Drive 동기화
Engineering/
Mechanical/ Piping, HVAC, Equipment
Electrical/ Power, Instrumentation
Chemical/ Process, Material
Civil
Network/ Server, Security, Infrastructure
## dataOrigin 값
- work: 자사 업무 관련 (TK, 테크니컬코리아, 공장, 생산, 사내)
- external: 외부 참고 자료 (뉴스, 논문, 법령, 일반 정보)
Industrial_Safety/
Legislation/ Act, Decree, Foreign_Law, Korea_Law_Archive, Enforcement_Rule, Public_Notice, SAPA
Theory/ Industrial_Safety_General, Safety_Health_Fundamentals
Academic_Papers/ Safety_General, Risk_Assessment_Research
Cases/ Domestic, International
Practice/ Checklist, Contractor_Management, Safety_Education, Emergency_Plan, Patrol_Inspection, Permit_to_Work, PPE, Safety_Plan
Risk_Assessment/ KRAS, JSA, Checklist_Method
Safety_Manager/ Appointment, Duty_Record, Improvement, Inspection, Meeting
Health_Manager/ Appointment, Duty_Record, Ergonomics, Health_Checkup, Mental_Health, MSDS, Work_Environment
## 분류 대상 문서
Programming/
Programming_Language/ Python, JavaScript, Go, Rust
Framework/ FastAPI, SvelteKit, React
DevOps/ Docker, CI_CD, Linux_Administration
AI_ML/ Large_Language_Model, Computer_Vision, Data_Science
Database
Software_Architecture
General/
Reading_Notes, Self_Development, Business, Science, History
## Classification Rules
- domain MUST be the most specific leaf node (e.g., Industrial_Safety/Practice/Patrol_Inspection, NOT Industrial_Safety/Practice)
- domain MUST be exactly ONE path
- If content spans multiple domains, choose by PRIMARY purpose
- If safety content is >30%, prefer Industrial_Safety
- If code is included, prefer Programming
- 2-level paths allowed ONLY when no leaf exists (e.g., Engineering/Civil)
## Document Types (select exactly ONE)
Reference, Standard, Manual, Drawing, Template, Note, Academic_Paper, Law_Document, Report, Memo, Checklist, Meeting_Minutes, Specification
### Document Type Detection Rules
- Step-by-step instructions → Manual
- Legal clauses/regulations → Law_Document
- Technical requirements → Specification
- Meeting discussion → Meeting_Minutes
- Checklist format → Checklist
- Academic/research format → Academic_Paper
- Technical drawings → Drawing
- If unclear → Note
## Confidence (0.0 ~ 1.0)
- How confident are you in the domain classification?
- 0.85+ = high confidence, 0.6~0.85 = moderate, <0.6 = uncertain
## Tags
- Free-form tags (Korean or English)
- Include: person names, technology names, concepts, project names
- Maximum 5 tags
## Importance
- high: urgent or critical documents
- medium: normal working documents
- low: reference or archive material
## sourceChannel
- inbox_route (this classification)
## dataOrigin
- work: company-related (TK, Technicalkorea, factory, production)
- external: external reference (news, papers, laws, general info)
## Document to classify
{document_text}