Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

155

Skill

n/a

Size

976.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

finetuning/workflows/diagnose-poor-results.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown40 linesFree

finetuning/workflows/diagnose-poor-results.md

1# Diagnosing Poor Results
2 
3When your fine-tuned model performs worse than expected, work through this checklist top-down (most common causes first).
4 
5## Diagnostic Table
6 
7| # | Symptom | Likely Cause | Fix |
8|---|---------|-------------|-----|
9| 1 | Training loss → 0, validation loss rises | Overfitting | 1) Deploy earlier checkpoint. 2) Reduce epochs. 3) Lower LR. 4) Add more diverse data. Overfitting ratio > 1.5 is concerning. |
10| 2 | High correctness, low conciseness (or reverse) | Dataset style mismatch | **Verbose**: Add concise examples, use "Be concise" system prompt, filter to shortest correct examples. **Terse**: Add detailed examples, increase dataset with quality-filtered data. |
11| 3 | Model seems good on spot-check but auto-eval is low | Evaluation rubric issue | Manually grade 10 examples vs. LLM judge. Check: Is judge model strong enough? Is rubric clear? Do reference answers match desired output? |
12| 4 | Garbage, empty outputs, or errors | Deployment/client bug | Check: wrong model format (→ HTTP 500), `AzureOpenAI` on project endpoint (→ "api-version not allowed"), low capacity (→ timeouts), wrong deployment name. Test with curl. |
13| 5 | RFT model scores below base model | RFT-specific issue | See RFT section below. |
14 
15## RFT-Specific Diagnosis
16 
17| Signal | Meaning | Fix |
18|--------|---------|-----|
19| Train-val grader gap > 0.2 | Model gaming the grader | Use stricter/more deterministic grader (Python execution > LLM judge) |
20| Grader too easy | High grader scores but bad outputs | Add multi-criteria grading (syntax + semantic) |
21| Grader too noisy | Random signal, no learning | Use deterministic grader or increase val set size |
22| All of the above fail | RFT may not suit this task | Switch back to SFT |
23 
24## Escalation Path
25 
26If nothing above helps:
27 
281. **Try a different base model** — some fine-tune better for certain tasks
292. **Increase dataset 2x-5x** with synthetic data
303. **Simplify the task** — fine-tune for a narrower sub-task first
314. **Try prompt engineering instead** — sometimes a well-crafted system prompt beats fine-tuning
325. **Combine approaches** — prompt engineering + fine-tuning together
33 
34## Red Flags: Don't Fine-Tune
35 
36- Base model already scores > 9.0 (minimal headroom)
37- Task changes frequently (constant retraining needed)
38- < 50 examples and can't generate synthetic data
39- "Correct" output is highly subjective
40

Preparing the source view

Microsoft Foundry Skill

finetuning/workflows/diagnose-poor-results.md