Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/workflows/diagnose-poor-results.md
1# Diagnosing Poor Results23When your fine-tuned model performs worse than expected, work through this checklist top-down (most common causes first).45## Diagnostic Table67| # | Symptom | Likely Cause | Fix |8|---|---------|-------------|-----|9| 1 | Training loss → 0, validation loss rises | Overfitting | 1) Deploy earlier checkpoint. 2) Reduce epochs. 3) Lower LR. 4) Add more diverse data. Overfitting ratio > 1.5 is concerning. |10| 2 | High correctness, low conciseness (or reverse) | Dataset style mismatch | **Verbose**: Add concise examples, use "Be concise" system prompt, filter to shortest correct examples. **Terse**: Add detailed examples, increase dataset with quality-filtered data. |11| 3 | Model seems good on spot-check but auto-eval is low | Evaluation rubric issue | Manually grade 10 examples vs. LLM judge. Check: Is judge model strong enough? Is rubric clear? Do reference answers match desired output? |12| 4 | Garbage, empty outputs, or errors | Deployment/client bug | Check: wrong model format (→ HTTP 500), `AzureOpenAI` on project endpoint (→ "api-version not allowed"), low capacity (→ timeouts), wrong deployment name. Test with curl. |13| 5 | RFT model scores below base model | RFT-specific issue | See RFT section below. |1415## RFT-Specific Diagnosis1617| Signal | Meaning | Fix |18|--------|---------|-----|19| Train-val grader gap > 0.2 | Model gaming the grader | Use stricter/more deterministic grader (Python execution > LLM judge) |20| Grader too easy | High grader scores but bad outputs | Add multi-criteria grading (syntax + semantic) |21| Grader too noisy | Random signal, no learning | Use deterministic grader or increase val set size |22| All of the above fail | RFT may not suit this task | Switch back to SFT |2324## Escalation Path2526If nothing above helps:27281. **Try a different base model** — some fine-tune better for certain tasks292. **Increase dataset 2x-5x** with synthetic data303. **Simplify the task** — fine-tune for a narrower sub-task first314. **Try prompt engineering instead** — sometimes a well-crafted system prompt beats fine-tuning325. **Combine approaches** — prompt engineering + fine-tuning together3334## Red Flags: Don't Fine-Tune3536- Base model already scores > 9.0 (minimal headroom)37- Task changes frequently (constant retraining needed)38- < 50 examples and can't generate synthetic data39- "Correct" output is highly subjective40