Provisioned Throughput Units (PTU) Guide
Table of Contents: Understanding PTU vs Standard TPM · When to Use PTU · PTU Capacity Planning · Deploy Model with PTU · Request PTU Quota Increase · Understanding Region and Deployment Quotas · External Resources
Understanding PTU vs Standard TPM
Microsoft Foundry offers two quota types:
Standard TPM (Tokens Per Minute)
- Pay-as-you-go model, charged per token
- Each deployment consumes capacity units (e.g., 10K TPM, 50K TPM)
- Total regional quota shared across all deployments
- Subject to rate limiting during high demand (429 errors possible)
- Best for: Variable workloads, development, testing, bursty traffic
Provisioned Throughput Units (PTU)
- Monthly commitment for guaranteed throughput
- No rate limiting, consistent latency
- Measured in PTU units (not TPM)
- Best for: Predictable, high-volume production workloads
- More cost-effective when consistent token usage justifies monthly commitment
When to Use PTU
| Factor | Standard (TPM) | Provisioned (PTU) |
|---|---|---|
| Best For | Variable workloads, development, testing | Predictable production workloads |
| Pricing | Pay-per-token | Monthly commitment (hourly rate per PTU) |
| Rate Limits | Yes (429 errors possible) | No (guaranteed throughput) |
| Latency | Variable | Consistent |
| Cost Decision | Lower upfront commitment | More economical for consistent, high-volume usage |
| Flexibility | Scale up/down instantly | Requires planning and commitment |
| Use Case | Prototyping, bursty traffic | Production apps, high-volume APIs |
Use PTU when:
- Consistent, predictable token usage where monthly commitment is cost-effective
- Need guaranteed throughput (no 429 rate limit errors)
- Require consistent latency with performance SLA
- High-volume production workloads with stable traffic patterns
Decision Guidance: Compare your current pay-as-you-go costs with PTU pricing. PTU may be more economical when consistent usage justifies the monthly commitment.
PTU Capacity Planning
Official Calculation Methods
Agent Instruction: Only present official Azure capacity calculator methods below. Do NOT generate or suggest estimated PTU formulas, TPM-per-PTU conversion tables, or reference deprecated calculators (oai.azure.com/portal/calculator).
Calculate PTU requirements using these official methods:
Method 1: Microsoft Foundry Portal
- Navigate to Microsoft Foundry portal
- Go to Operate → Quota
- Select Provisioned throughput unit tab
- Click Capacity calculator button
- Enter workload parameters (model, tokens/call, RPM, latency target)
- Calculator returns exact PTU count needed
Method 2: Using Azure REST API
# Calculate required PTU capacity
curl -X POST "https://management.azure.com/subscriptions/<subscription-id>/providers/Microsoft.CognitiveServices/calculateModelCapacity?api-version=2024-10-01" \
-H "Authorization: Bearer <access-token>" \
-H "Content-Type: application/json" \
-d '{
"model": {
"format": "OpenAI",
"name": "gpt-4o",
"version": "2024-05-13"
},
"workload": {
"requestPerMin": 100,
"tokensPerMin": 50000,
"peakRequestsPerMin": 150
}
}'Deploy Model with PTU
Step 1: Calculate PTU Requirements
Use the official capacity calculator methods above to determine required PTU capacity.
Step 2: Deploy with PTU
# Deploy model with calculated PTU capacity
az cognitiveservices account deployment create \
--name <resource-name> \
--resource-group <rg> \
--deployment-name gpt-4o-ptu-deployment \
--model-name gpt-4o \
--model-version "2024-05-13" \
--model-format OpenAI \
--sku-name ProvisionedManaged \
--sku-capacity 100
# Check PTU deployment status
az cognitiveservices account deployment show \
--name <resource-name> \
--resource-group <rg> \
--deployment-name gpt-4o-ptu-deploymentKey Differences from Standard TPM:
- SKU name:
ProvisionedManaged(notStandard) - Capacity: Measured in PTU units (not K TPM)
- Billing: Monthly commitment regardless of usage
- No rate limiting (guaranteed throughput)
Request PTU Quota Increase
PTU quota is separate from TPM quota and requires specific justification:
- Navigate to Azure Portal → Foundry resource → Quotas
- Select Provisioned throughput unit tab
- Identify model needing PTU increase (e.g., "GPT-4o PTU")
- Click Request quota increase
- Fill form:
- Model name
- Requested PTU quota
- Include capacity calculator results in business justification
- Explain workload characteristics (volume, latency requirements)
- Submit and monitor status
Processing Time: Typically 3-5 business days (longer than standard quota requests) Note: PTU quota requests typically require stronger business justification due to commitment nature
Alternative: Deploy to different region with available PTU quota
Understanding Region and Deployment Quotas
Region Quota
- Maximum PTU capacity available in an Azure region
- Varies by model type (GPT-4, GPT-4o, etc.)
- Shared across subscription resources in same region
- Separate from TPM quota (you have both TPM and PTU quotas)
Deployment Slots
- Number of concurrent model deployments allowed
- Typically 10-20 slots per resource
- Each PTU deployment uses one slot (same as TPM deployments)
- Deployment count limit is independent of capacity