Capacity Planning Guide

Comprehensive guide for planning Azure AI Foundry capacity, including cost analysis, model selection, and workload calculations.

Table of Contents: Cost Comparison: TPM vs PTU · Production Workload Examples · Model Selection and Deployment Type Guidance

Cost Comparison: TPM vs PTU

Official Pricing Sources:
Azure OpenAI Service Pricing - Official pay-per-token rates
PTU Costs and Billing Guide - PTU hourly rates and capacity planning

TPM (Standard) Pricing:

Pay-per-token for input/output
No upfront commitment
Rates: See Azure OpenAI Pricing
GPT-4o: ~$0.0025-$0.01/1K tokens
GPT-4 Turbo: ~$0.01-$0.03/1K
GPT-3.5 Turbo: ~$0.0005-$0.0015/1K
Best for: Variable workloads, unpredictable traffic

PTU (Provisioned) Pricing:

Hourly billing: $/PTU/hr × PTUs × 730 hrs/month
Monthly commitment with Reservations discounts
Rates: See PTU Billing Guide
Use PTU calculator to determine requirements (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
Best for: High-volume (>1M tokens/day), predictable traffic, guaranteed throughput

Cost Decision Framework (Analytical Guidance):

Step 1: Calculate monthly TPM cost
  Monthly TPM cost = (Daily tokens × 30 days × $price per 1K tokens) / 1000

Step 2: Calculate monthly PTU cost
  Monthly PTU cost = Required PTUs × 730 hours/month × $PTU-hour rate
  (Get Required PTUs from Azure AI Foundry portal: Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)

Step 3: Compare
  Use PTU when: Monthly PTU cost < (Monthly TPM cost × 0.7)
  (Use 70% threshold to account for commitment risk)

Example Calculation (Analytical):

Scenario: 1M requests/day, average 1,000 tokens per request

Daily tokens: 1,000,000 × 1,000 = 1B tokens/day
TPM Cost (using GPT-4o at $0.005/1K avg): (1B × 30 × $0.005) / 1000 = ~$150,000/month
PTU Cost (estimated 100 PTU at ~$5/PTU-hour): 100 PTU × 730 hours × $5 = ~$365,000/month
Decision: Use TPM (significantly lower cost for this workload)

Important: Always use the official Azure Pricing Calculator and Azure AI Foundry portal PTU calculator (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab) for exact pricing by model, region, and workload. Prices vary by region and are subject to change.

Production Workload Examples

To estimate quota requirements, use real-world production scenarios with capacity calculations for gpt-4, version 0613 (from Azure Foundry Portal calculator):

Workload Type	Calls/Min	Prompt Tokens	Response Tokens	Cache Hit %	Total Tokens/Min	PTU Required	TPM Equivalent
RAG Chat	10	3,500	300	20%	38,000	100	38K TPM
Basic Chat	10	500	100	20%	6,000	100	6K TPM
Summarization	10	5,000	300	20%	53,000	100	53K TPM
Classification	10	3,800	10	20%	38,100	100	38K TPM

How to Estimate Your Production Quota Requirements:

To calculate your quota needs for production deployments, follow these steps:

Determine your peak calls per minute: Monitor or estimate maximum concurrent requests
Measure token usage: Average prompt size + response size
Account for cache hits: Prompt caching can reduce effective token count by 20-50%
Calculate total tokens/min: (Calls/min × (Prompt tokens + Response tokens)) × (1 - Cache %)
Choose deployment type:

TPM (Standard): Allocate 1.5-2× your calculated tokens/min for headroom
PTU (Provisioned): Use Azure AI Foundry portal PTU calculator for exact PTU count (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)

Example Calculation (RAG Chat Production):

Peak: 10 calls/min
Prompt: 3,500 tokens (context + question)
Response: 300 tokens (answer)
Cache: 20% hit rate (reduces prompt tokens by 20%)
Total TPM needed: (10 × (3,500 × 0.8 + 300)) = 31,000 TPM
With 50% headroom: 46,500 TPM → Round to 50K TPM deployment

PTU Recommendation: For the combined workload (40 calls/min, 135K tokens/min total), use 200 PTU (from calculator above).

Model Selection and Deployment Type Guidance

Official Documentation:
Choose the Right AI Model for Your Workload - Microsoft Architecture Center
Azure OpenAI Models - Model capabilities, regions, and quotas
Understanding Deployment Types - Standard vs Provisioned guidance

Model Characteristics (from official Azure OpenAI documentation):

Model	Key Characteristics	Best For
GPT-4o	Matches GPT-4 Turbo performance in English text/coding, superior in non-English and vision tasks. Cheaper and faster than GPT-4 Turbo.	Multimodal tasks, cost-effective general purpose, high-volume production workloads
GPT-4 Turbo	Superior reasoning capabilities, larger context window (128K tokens)	Complex reasoning tasks, long-context analysis
GPT-3.5 Turbo	Most cost-effective, optimized for chat and completions, fast response time	Simple tasks, customer service, high-volume low-cost scenarios
GPT-4o mini	Fastest response time, low latency	Latency-sensitive applications requiring immediate responses
text-embedding-3-large	Purpose-built for vector embeddings	RAG applications, semantic search, document similarity

Deployment Type Selection (from official deployment types guide):

Traffic Pattern	Recommended Deployment Type	Reason
Variable, bursty traffic	Standard or Global Standard (pay-per-token)	No commitment, pay only for usage
Consistent high volume	Provisioned types (PTU)	Reserved capacity, predictable costs
Large batch jobs (non-time-sensitive)	Global Batch or DataZone Batch	50% cost savings vs Standard
Low latency variance required	Provisioned types	Guaranteed throughput, no rate limits
No regional restrictions	Global Standard or Global Provisioned	Access to best available capacity

Capacity Planning Approach (from PTU onboarding guide):

To calculate and estimate your capacity requirements:

Calculate your TPM requirements: Determine required tokens per minute based on your expected workload
Use the built-in capacity planner: Available in Azure AI Foundry portal (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
Input your metrics: Enter input TPM and output TPM based on your workload characteristics
Get PTU recommendation: The calculator provides PTU allocation recommendation
Compare costs: Evaluate Standard (TPM) vs Provisioned (PTU) using the official pricing calculator

Note: Microsoft does not publish specific "X requests/day = Y TPM" recommendations as capacity requirements vary significantly based on prompt size, response length, cache hit rates, and model choice. Use the built-in capacity planner with your actual workload characteristics.

Preparing the source view

Microsoft Foundry Skill

quota/references/capacity-planning.md

Capacity Planning Guide

Cost Comparison: TPM vs PTU

Production Workload Examples

Model Selection and Deployment Type Guidance