Diagnosing Poor Results

When your fine-tuned model performs worse than expected, work through this checklist top-down (most common causes first).

Diagnostic Table

#	Symptom	Likely Cause	Fix
1	Training loss → 0, validation loss rises	Overfitting	1) Deploy earlier checkpoint. 2) Reduce epochs. 3) Lower LR. 4) Add more diverse data. Overfitting ratio > 1.5 is concerning.
2	High correctness, low conciseness (or reverse)	Dataset style mismatch	Verbose: Add concise examples, use "Be concise" system prompt, filter to shortest correct examples. Terse: Add detailed examples, increase dataset with quality-filtered data.
3	Model seems good on spot-check but auto-eval is low	Evaluation rubric issue	Manually grade 10 examples vs. LLM judge. Check: Is judge model strong enough? Is rubric clear? Do reference answers match desired output?
4	Garbage, empty outputs, or errors	Deployment/client bug	Check: wrong model format (→ HTTP 500), `AzureOpenAI` on project endpoint (→ "api-version not allowed"), low capacity (→ timeouts), wrong deployment name. Test with curl.
5	RFT model scores below base model	RFT-specific issue	See RFT section below.

Signal	Meaning	Fix
Train-val grader gap > 0.2	Model gaming the grader	Use stricter/more deterministic grader (Python execution > LLM judge)
Grader too easy	High grader scores but bad outputs	Add multi-criteria grading (syntax + semantic)
Grader too noisy	Random signal, no learning	Use deterministic grader or increase val set size
All of the above fail	RFT may not suit this task	Switch back to SFT

If nothing above helps:

Try a different base model — some fine-tune better for certain tasks
Increase dataset 2x-5x with synthetic data
Simplify the task — fine-tune for a narrower sub-task first
Try prompt engineering instead — sometimes a well-crafted system prompt beats fine-tuning
Combine approaches — prompt engineering + fine-tuning together

When your fine-tuned model performs worse than expected, work through this checklist top-down (most common causes first).

#	Symptom	Likely Cause	Fix
1	Training loss → 0, validation loss rises	Overfitting	1) Deploy earlier checkpoint. 2) Reduce epochs. 3) Lower LR. 4) Add more diverse data. Overfitting ratio > 1.5 is concerning.
2	High correctness, low conciseness (or reverse)	Dataset style mismatch	Verbose: Add concise examples, use "Be concise" system prompt, filter to shortest correct examples. Terse: Add detailed examples, increase dataset with quality-filtered data.
3	Model seems good on spot-check but auto-eval is low	Evaluation rubric issue	Manually grade 10 examples vs. LLM judge. Check: Is judge model strong enough? Is rubric clear? Do reference answers match desired output?
4	Garbage, empty outputs, or errors	Deployment/client bug	Check: wrong model format (→ HTTP 500), `AzureOpenAI` on project endpoint (→ "api-version not allowed"), low capacity (→ timeouts), wrong deployment name. Test with curl.
5	RFT model scores below base model	RFT-specific issue	See RFT section below.

Signal	Meaning	Fix
Train-val grader gap > 0.2	Model gaming the grader	Use stricter/more deterministic grader (Python execution > LLM judge)
Grader too easy	High grader scores but bad outputs	Add multi-criteria grading (syntax + semantic)
Grader too noisy	Random signal, no learning	Use deterministic grader or increase val set size
All of the above fail	RFT may not suit this task	Switch back to SFT

If nothing above helps:

Try a different base model — some fine-tune better for certain tasks
Increase dataset 2x-5x with synthetic data
Simplify the task — fine-tune for a narrower sub-task first
Try prompt engineering instead — sometimes a well-crafted system prompt beats fine-tuning
Combine approaches — prompt engineering + fine-tuning together