Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
finetuning/workflows/diagnose-poor-results.md
1# Diagnosing Poor Results23When your fine-tuned model performs worse than expected, work through this checklist top-down (most common causes first).45## Diagnostic Table67| # | Symptom | Likely Cause | Fix |8|---|---------|-------------|-----|9| 1 | Training loss → 0, validation loss rises | Overfitting | 1) Deploy earlier checkpoint. 2) Reduce epochs. 3) Lower LR. 4) Add more diverse data. Overfitting ratio > 1.5 is concerning. |10| 2 | High correctness, low conciseness (or reverse) | Dataset style mismatch | **Verbose**: Add concise examples, use "Be concise" system prompt, filter to shortest correct examples. **Terse**: Add detailed examples, increase dataset with quality-filtered data. |11| 3 | Model seems good on spot-check but auto-eval is low | Evaluation rubric issue | Manually grade 10 examples vs. LLM judge. Check: Is judge model strong enough? Is rubric clear? Do reference answers match desired output? |12| 4 | Garbage, empty outputs, or errors | Deployment/client bug | Check: wrong model format (→ HTTP 500), `AzureOpenAI` on project endpoint (→ "api-version not allowed"), low capacity (→ timeouts), wrong deployment name. Test with curl. |13| 5 | RFT model scores below base model | RFT-specific issue | See RFT section below. |1415## RFT-Specific Diagnosis1617| Signal | Meaning | Fix |18|--------|---------|-----|19| Train-val grader gap > 0.2 | Model gaming the grader | Use stricter/more deterministic grader (Python execution > LLM judge) |20| Grader too easy | High grader scores but bad outputs | Add multi-criteria grading (syntax + semantic) |21| Grader too noisy | Random signal, no learning | Use deterministic grader or increase val set size |22| All of the above fail | RFT may not suit this task | Switch back to SFT |2324## Escalation Path2526If nothing above helps:27281. **Try a different base model** — some fine-tune better for certain tasks292. **Increase dataset 2x-5x** with synthetic data303. **Simplify the task** — fine-tune for a narrower sub-task first314. **Try prompt engineering instead** — sometimes a well-crafted system prompt beats fine-tuning325. **Combine approaches** — prompt engineering + fine-tuning together3334## Red Flags: Don't Fine-Tune3536- Base model already scores > 9.0 (minimal headroom)37- Task changes frequently (constant retraining needed)38- < 50 examples and can't generate synthetic data39- "Correct" output is highly subjective40