Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

154

Skill

n/a

Size

976.2 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/workflows/diagnose-poor-results.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown40 linesFree

finetuning/workflows/diagnose-poor-results.md

1# Diagnosing Poor Results
2 
3When your fine-tuned model performs worse than expected, work through this checklist top-down (most common causes first).
4 
5## Diagnostic Table
6 
7| # | Symptom | Likely Cause | Fix |
8|---|---------|-------------|-----|
9| 1 | Training loss → 0, validation loss rises | Overfitting | 1) Deploy earlier checkpoint. 2) Reduce epochs. 3) Lower LR. 4) Add more diverse data. Overfitting ratio > 1.5 is concerning. |
10| 2 | High correctness, low conciseness (or reverse) | Dataset style mismatch | **Verbose**: Add concise examples, use "Be concise" system prompt, filter to shortest correct examples. **Terse**: Add detailed examples, increase dataset with quality-filtered data. |
11| 3 | Model seems good on spot-check but auto-eval is low | Evaluation rubric issue | Manually grade 10 examples vs. LLM judge. Check: Is judge model strong enough? Is rubric clear? Do reference answers match desired output? |
12| 4 | Garbage, empty outputs, or errors | Deployment/client bug | Check: wrong model format (→ HTTP 500), `AzureOpenAI` on project endpoint (→ "api-version not allowed"), low capacity (→ timeouts), wrong deployment name. Test with curl. |
13| 5 | RFT model scores below base model | RFT-specific issue | See RFT section below. |
14 
15## RFT-Specific Diagnosis
16 
17| Signal | Meaning | Fix |
18|--------|---------|-----|
19| Train-val grader gap > 0.2 | Model gaming the grader | Use stricter/more deterministic grader (Python execution > LLM judge) |
20| Grader too easy | High grader scores but bad outputs | Add multi-criteria grading (syntax + semantic) |
21| Grader too noisy | Random signal, no learning | Use deterministic grader or increase val set size |
22| All of the above fail | RFT may not suit this task | Switch back to SFT |
23 
24## Escalation Path
25 
26If nothing above helps:
27 
281. **Try a different base model** — some fine-tune better for certain tasks
292. **Increase dataset 2x-5x** with synthetic data
303. **Simplify the task** — fine-tune for a narrower sub-task first
314. **Try prompt engineering instead** — sometimes a well-crafted system prompt beats fine-tuning
325. **Combine approaches** — prompt engineering + fine-tuning together
33 
34## Red Flags: Don't Fine-Tune
35 
36- Base model already scores > 9.0 (minimal headroom)
37- Task changes frequently (constant retraining needed)
38- < 50 examples and can't generate synthetic data
39- "Correct" output is highly subjective
40

Preparing the source view

Microsoft Foundry Skill

finetuning/workflows/diagnose-poor-results.md