Commit Graph

4 Commits

Author SHA1 Message Date
hyungi 446ba82c91 feat(eval): Phase 2Q Diagnose Phase 1A — fixture (4 카테고리 × 2 LLM) + prompt v1
phase-2q-query-rewrite-diagnose.md v6 plan 의 Phase 1 fixture 박제 (G0-1 + G0-2).

산출물:
- app/prompts/query_rewrite.txt — multi-query rewrite prompt v1 (3 variants: 원본 + 한국어 rephrase + 영어 번역)
- tests/fixtures/macmini_gemma4_query_rewrite_response.json — 4 카테고리 (korean_only/mixed/english_only/exam)
- tests/fixtures/macbook_qwen_query_rewrite_response.json — 4 카테고리 동일

inspect 9 결과 (2026-05-24):
- Mac mini gemma-4-26B-A4B :8801 = response_format json_object 지원
- MacBook qwen3.6-27B-8bit :8810 = response_format json_object 미지원 (120s hang) — prompt rule only
- prompt rule \"no markdown, no code fence\" 강제 시 둘 다 strict JSON (gemma 도 fence wrap 없음)
- parser fallback (markdown fence regex) 유지 — 첫 호출 prompt 없을 때 wrap 관찰 사례

8 호출 측정:
- gemma 1.16~1.36s / qwen 1.93~2.24s (warm)
- variants 의미 일관 + 도메인 용어 (ASME/Section VIII/압력용기/가스기사) verbatim preserve
- 한국어→영어 cross-lingual translation 자연

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 22:09:29 +00:00
hyungi 076c0e1802 feat(eval): Phase 2B Reranker Diagnose — dispatcher + gte 측정 + decision (H3 bge-reranker-v2-m3 유지)
round-2-review-mighty-starfish.md v2.1 (Phase 2B Reranker Diagnose) plan 실행.
Phase 2A 의 CANDIDATE_BACKEND_MAP 패턴 재사용 + RERANKER_BACKEND_MAP 신규.

코드 변경 (4 파일):
- app/services/search/rerank_service.py:
  - RERANKER_BACKEND_MAP allowlist (baseline / cand_gte_ml_base, slug-based resolve)
  - _resolve_reranker(slug) → endpoint URL or None
  - _rerank_via_candidate_endpoint() — 후보 TEI POST /rerank
  - rerank_chunks() 시그니처에 reranker_backend + snapshot_*_id_max 추가 + dispatch log
- app/services/search/search_pipeline.py: run_search() threading
- app/api/search.py: reranker_backend Query parameter + 400 unknown_reranker_backend 에러 매핑
- tests/search_eval/run_eval.py: --reranker-backend flag + call_search/evaluate threading

infra:
- docker-compose.override.rerank-cand.yml: 3 후보 service (gte_ml_base / mxbai_large / bge_v2_gemma_2b),
  profile 'rerank-cand' 격리, restart=unless-stopped

측정 산출물 (51 case, scored=46, failure=5):
- reports/v0_2_phase2b_baseline_snapshot_2026-05-23.csv (NDCG 0.659, Phase 2A 와 일치 = 재현성 PASS)
- reports/v0_2_phase2b_gte_ml_base_2026-05-23.csv
- tests/search_eval/baselines/v0_2_phase2b_{baseline_snapshot,gte_ml_base}_2026-05-23.json
- reports/phase_2b_reranker_decision_2026-05-23.md
- tests/fixtures/tei_rerank_response.json (G0-1 한국어+영어 mixed sample sanity PASS)

후보 TEI 1.7 호환성 (Phase 1 smoke gate):
- cand_gte_ml_base       :  PASS (xlm-roberta-based, TEI 호환)
- cand_mxbai_large       :  deberta-v2 미지원 → Phase 2B-Extended (sentence-transformers wrapper)
- cand_bge_v2_gemma_2b   :  LLM-based reranker, 1_Pooling/config.json 부재 → Phase 2B-Extended (FlagEmbedding wrapper)

결과 (1 후보 측정 + baseline rebaseline):
| Candidate                          | NDCG  | Δ baseline | mixed | korean | exam  | p50 ms |
|------------------------------------|------:|-----------:|------:|-------:|------:|-------:|
| bge-reranker-v2-m3 (baseline)      | 0.659 | —          | 0.39  | 0.51   | 0.74  | 454    |
| cand_gte_ml_base                   | 0.604 | -0.055     | 0.38  | 0.41   | 0.62  | 345    |

Decision (H3): bge-reranker-v2-m3 유지. gte 의 reranker quality 가 production 보다 약함 (korean_only -0.10, exam -0.12, overall -0.055).

후속 PR 백로그 (6건):
- PR-Search-Query-Rewrite-1 (Phase 2Q, korean_only/mixed 보완 권고)
- PR-2B-Extended-Mxbai-Large (sentence-transformers wrapper)
- PR-2B-Extended-Bge-V2-Gemma (FlagEmbedding LayerwiseReranker wrapper)
- PR-2B-Extended-Jina-V2-ML (license 결정 후, 개인 비영리 가정)
- PR-2B-Cloud-Reranker-Scaffold-1 (Cohere scaffold-only, 선택)
- PR-2B-Rerank-Cand-Cleanup-1 (1주 후 cand 컨테이너 정리)

production 영향:
- production reranker (bge-reranker-v2-m3) 변경 0
- config.yaml ai.models.rerank.endpoint 변경 0
- embedding (bge-m3 ollama) 변경 0 (Phase 2A 결정 보존)
- documents / document_chunks 변경 0 (21365 docs / 30605 chunks 그대로)
- 4 smoke PASS (baseline / baseline+snapshot / cand_gte_ml_base / cand_invalid → 400)
- dispatch log 박제 verify (endpoint + snapshot id)

closure gate: 16 항목 PASS (flex closure 조항 적용 — 1 후보 측정, 2 후보 TEI 호환 탈락 사유 명시).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 08:37:42 +00:00
hyungi 943ac5f59c feat(eval): Phase 2A Diagnose Phase 1 — TEI candidate compose override + fixture G0
Phase 2A Embedding Diagnose 본 PR 의 Phase 1 산출물.

- docker-compose.override.cand.yml: 4 후보 service, profile 'embed-cand' 격리
  - active: me5_large_inst (intfloat/multilingual-e5-large-instruct, smoke PASS)
  - active: snowflake_l_v2 (Snowflake/snowflake-arctic-embed-l-v2.0, smoke PASS)
  - 비활성 (extended profile): bge_mgemma2 (9B FP16 OOM risk → 별 PR 이관)
  - 비활성 (disabled profile): me5_ko (HF 401 → 폐기)

- tests/fixtures/: G0 fixture 3건 박제
  - ollama_bge_m3_embedding_response.json (G0-2: dim 1024, flat dict shape)
  - tei_embedding_response.json (G0-1: me5_large_inst, dim 1024, nested array)
  - tei_embedding_snowflake_l_v2_response.json (G0-1: snowflake, dim 1024, nested array)

운영 변경 0 (profile 격리, default up 시 미기동). production 9 컨테이너 영향 없음.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 05:04:21 +00:00
hyungi 51c3f6df10 feat(search): /ask/react endpoint with Qwen native tool calling ReAct loop
PR-DocSrv-Ask-ToolCalling-ReAct-1 — Qwen3.6-27B-8bit 의 native tool calling
으로 ReAct loop 도입. 기존 /api/search/ask 무수정. 트랙 B (frontend /ask SSE)
와 파일 단위 충돌 0 (search.py 의 ask() 함수 line diff = 0, 순수 추가).

핵심 invariant:
- 별 endpoint /api/search/ask/react (qwen-macbook only, implicit opt-in)
- MacBook unavailable 시 HTTP 503 + error_reason=macbook_unavailable.
  Gemma 자동 fallback X (정정 4 의 연장)

G0 (구현 전 hard gate, plan b-velvety-hare.md):
- G0-1 fixture (tests/fixtures/qwen_tool_call_response.json): 실제 mlx-vlm
  응답 박제. shape = OpenAI 표준 호환 (choices[0].message.tool_calls +
  function.arguments JSON string). generate_with_tools() 가 본 shape 기준 구현.
- G0-2 counter semantics: max_tool_rounds=2 + max_llm_calls=3 + search_exec_max=2.
  마지막 LLM 호출은 tool_choice="none" + system instruction 으로 final 강제.
- G0-3 trace exposure: default response 의 debug_trace=null. debug=true 시만
  채움. server log 에는 항상 round 기록.

backends.py (193 → 261줄):
- QwenMacBookBackend.generate_with_tools(messages, tools, tool_choice)
  신규 method. 기존 generate() 무수정. BackendUnavailable 처리 동일.

react_loop.py 신규 (275줄):
- agentic_ask_loop(session, query, *, backend, max_tool_rounds, debug)
- tool round 안에서 run_search 호출, results dedup by id, final round 강제,
  partial=True 조건 (final content 빈 경우)

search.py (+82줄):
- POST /api/search/ask/react + AskReactRequest/Response schema
- BackendUnavailable → JSONResponse(503, error_reason=macbook_unavailable)

config.yaml + config.py:
- search.ask.react: { enabled, max_tool_rounds=2, search_tool_limit=5,
  search_tool_mode=hybrid }

tests (566줄, 18 신규 + 23 회귀 모두 PASS):
- test_react_loop.py 13건: G0-1 fixture shape / G0-2 counter cap / G0-3 trace
  exposure / BackendUnavailable propagation / sources dedup
- test_search_ask_react_endpoint.py 5건: 503 + run_search 호출 0 / 정상 200 /
  debug=true trace 노출 / max rounds partial
- 회귀 (test_ask_eval_auth 9 + test_search_ask_macbook_503 5 +
  test_backend_dispatcher 9) 모두 PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 13:43:47 +00:00