Provisioned Throughput Units (PTU) Guide

Table of Contents: Understanding PTU vs Standard TPM · When to Use PTU · PTU Capacity Planning · Deploy Model with PTU · Request PTU Quota Increase · Understanding Region and Deployment Quotas · External Resources

Understanding PTU vs Standard TPM

Microsoft Foundry offers two quota types:

Standard TPM (Tokens Per Minute)

Pay-as-you-go model, charged per token
Each deployment consumes capacity units (e.g., 10K TPM, 50K TPM)
Total regional quota shared across all deployments
Subject to rate limiting during high demand (429 errors possible)
Best for: Variable workloads, development, testing, bursty traffic

Provisioned Throughput Units (PTU)

Monthly commitment for guaranteed throughput
No rate limiting, consistent latency
Measured in PTU units (not TPM)
Best for: Predictable, high-volume production workloads
More cost-effective when consistent token usage justifies monthly commitment

When to Use PTU

Factor	Standard (TPM)	Provisioned (PTU)
Best For	Variable workloads, development, testing	Predictable production workloads
Pricing	Pay-per-token	Monthly commitment (hourly rate per PTU)
Rate Limits	Yes (429 errors possible)	No (guaranteed throughput)
Latency	Variable	Consistent
Cost Decision	Lower upfront commitment	More economical for consistent, high-volume usage
Flexibility	Scale up/down instantly	Requires planning and commitment
Use Case	Prototyping, bursty traffic	Production apps, high-volume APIs

Use PTU when:

Consistent, predictable token usage where monthly commitment is cost-effective
Need guaranteed throughput (no 429 rate limit errors)
Require consistent latency with performance SLA
High-volume production workloads with stable traffic patterns

Decision Guidance: Compare your current pay-as-you-go costs with PTU pricing. PTU may be more economical when consistent usage justifies the monthly commitment.

PTU Capacity Planning

Official Calculation Methods

Agent Instruction: Only present official Azure capacity calculator methods below. Do NOT generate or suggest estimated PTU formulas, TPM-per-PTU conversion tables, or reference deprecated calculators (oai.azure.com/portal/calculator).

Calculate PTU requirements using these official methods:

Method 1: Microsoft Foundry Portal

Navigate to Microsoft Foundry portal
Go to Operate → Quota
Select Provisioned throughput unit tab
Click Capacity calculator button
Enter workload parameters (model, tokens/call, RPM, latency target)
Calculator returns exact PTU count needed

Method 2: Using Azure REST API

# Calculate required PTU capacity
curl -X POST "https://management.azure.com/subscriptions/<subscription-id>/providers/Microsoft.CognitiveServices/calculateModelCapacity?api-version=2024-10-01" \
  -H "Authorization: Bearer <access-token>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": {
      "format": "OpenAI",
      "name": "gpt-4o",
      "version": "2024-05-13"
    },
    "workload": {
      "requestPerMin": 100,
      "tokensPerMin": 50000,
      "peakRequestsPerMin": 150
    }
  }'

Deploy Model with PTU

Step 1: Calculate PTU Requirements

Use the official capacity calculator methods above to determine required PTU capacity.

Step 2: Deploy with PTU

# Deploy model with calculated PTU capacity
az cognitiveservices account deployment create \
  --name <resource-name> \
  --resource-group <rg> \
  --deployment-name gpt-4o-ptu-deployment \
  --model-name gpt-4o \
  --model-version "2024-05-13" \
  --model-format OpenAI \
  --sku-name ProvisionedManaged \
  --sku-capacity 100

# Check PTU deployment status
az cognitiveservices account deployment show \
  --name <resource-name> \
  --resource-group <rg> \
  --deployment-name gpt-4o-ptu-deployment

Key Differences from Standard TPM:

SKU name: ProvisionedManaged (not Standard)
Capacity: Measured in PTU units (not K TPM)
Billing: Monthly commitment regardless of usage
No rate limiting (guaranteed throughput)

Request PTU Quota Increase

PTU quota is separate from TPM quota and requires specific justification:

Navigate to Azure Portal → Foundry resource → Quotas
Select Provisioned throughput unit tab
Identify model needing PTU increase (e.g., "GPT-4o PTU")
Click Request quota increase
Fill form:

Model name
Requested PTU quota
Include capacity calculator results in business justification
Explain workload characteristics (volume, latency requirements)

Submit and monitor status

Processing Time: Typically 3-5 business days (longer than standard quota requests) Note: PTU quota requests typically require stronger business justification due to commitment nature

Alternative: Deploy to different region with available PTU quota

Understanding Region and Deployment Quotas

Region Quota

Maximum PTU capacity available in an Azure region
Varies by model type (GPT-4, GPT-4o, etc.)
Shared across subscription resources in same region
Separate from TPM quota (you have both TPM and PTU quotas)

Deployment Slots

Number of concurrent model deployments allowed
Typically 10-20 slots per resource
Each PTU deployment uses one slot (same as TPM deployments)
Deployment count limit is independent of capacity

Preparing the source view

Microsoft Foundry Skill

quota/references/ptu-guide.md