Platform Gotchas — Top 10
- OSS models require
"trainingType": "globalStandard"in the request body — undocumented, and all OSS FT jobs fail without it.
- Model catalog
fine_tuneflag is wrong for OSS models — API returnsfine_tune = falsefor all OSS models despite being FT-supported. Hardcode the supported list.
- Older SDK versions may fail on
/v1/project endpoints —client.files.create()throws "API version not supported" with olderopenaipackage versions. Upgrade toopenai>=1.0and use the/v1/project endpoint (preferred). If you must use an older SDK, fall back to REST API with the non-project/openai/endpoint.
- ARM "Succeeded" doesn't mean deployment is ready —
provisioningState: Succeededbut data plane returnsDeploymentNotReadyindefinitely. Delete and recreate the deployment, then wait ~5 minutes.
- OSS FT deployments may fail with InternalServerError — use the correct provider-specific
model.format(e.g.,"Mistral AI"not"OpenAI") and trycapacity=100.
- OSS FT inference hits "Failed to load LoRA" intermittently — deploy with capacity ≥ 100, use 8+ retries with exponential backoff, and wait 2+ minutes after deployment before first call.
- ARM REST and
az cognitiveservicesuse different format strings for OSS models — ARM uses provider names ("Microsoft","Meta"), CLI uses"OpenAI-OSS"for all OSS. Mixing them produces HTTP 500.
- Content safety false positives on entity extraction data — PII-dense data (medical records, legal docs, resumes) can trigger "Hate/Fairness" blocks at deployment time. Remove problematic document types.
- FT deployments at capacity=1 are severely rate-limited (~1 RPM) — evaluating 10 samples takes ~10 minutes. Use capacity ≥ 100 for eval workloads and exponential backoff.
- Wrong resource endpoint is a silent killer — jobs submitted to the wrong Foundry resource succeed via API but don't appear in the portal. Always verify the endpoint matches your Foundry project.