Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
quota/references/capacity-planning.md
1# Capacity Planning Guide23Comprehensive guide for planning Azure AI Foundry capacity, including cost analysis, model selection, and workload calculations.45**Table of Contents:** [Cost Comparison: TPM vs PTU](#cost-comparison-tpm-vs-ptu) · [Production Workload Examples](#production-workload-examples) · [Model Selection and Deployment Type Guidance](#model-selection-and-deployment-type-guidance)67## Cost Comparison: TPM vs PTU89> **Official Pricing Sources:**10> - [Azure OpenAI Service Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/) - Official pay-per-token rates11> - [PTU Costs and Billing Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding) - PTU hourly rates and capacity planning1213**TPM (Standard) Pricing:**14- Pay-per-token for input/output15- No upfront commitment16- **Rates**: See [Azure OpenAI Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/)17- GPT-4o: ~$0.0025-$0.01/1K tokens18- GPT-4 Turbo: ~$0.01-$0.03/1K19- GPT-3.5 Turbo: ~$0.0005-$0.0015/1K20- **Best for**: Variable workloads, unpredictable traffic2122**PTU (Provisioned) Pricing:**23- Hourly billing: `$/PTU/hr × PTUs × 730 hrs/month`24- Monthly commitment with Reservations discounts25- **Rates**: See [PTU Billing Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding)26- Use PTU calculator to determine requirements (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)27- **Best for**: High-volume (>1M tokens/day), predictable traffic, guaranteed throughput2829**Cost Decision Framework** (Analytical Guidance):3031```32Step 1: Calculate monthly TPM cost33Monthly TPM cost = (Daily tokens × 30 days × $price per 1K tokens) / 10003435Step 2: Calculate monthly PTU cost36Monthly PTU cost = Required PTUs × 730 hours/month × $PTU-hour rate37(Get Required PTUs from Azure AI Foundry portal: Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)3839Step 3: Compare40Use PTU when: Monthly PTU cost < (Monthly TPM cost × 0.7)41(Use 70% threshold to account for commitment risk)42```4344**Example Calculation** (Analytical):4546Scenario: 1M requests/day, average 1,000 tokens per request4748- **Daily tokens**: 1,000,000 × 1,000 = 1B tokens/day49- **TPM Cost** (using GPT-4o at $0.005/1K avg): (1B × 30 × $0.005) / 1000 = ~$150,000/month50- **PTU Cost** (estimated 100 PTU at ~$5/PTU-hour): 100 PTU × 730 hours × $5 = ~$365,000/month51- **Decision**: Use TPM (significantly lower cost for this workload)5253> **Important**: Always use the official [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator/) and Azure AI Foundry portal PTU calculator (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab) for exact pricing by model, region, and workload. Prices vary by region and are subject to change.5455---5657## Production Workload Examples5859To estimate quota requirements, use real-world production scenarios with capacity calculations for gpt-4, version 0613 (from Azure Foundry Portal calculator):6061| Workload Type | Calls/Min | Prompt Tokens | Response Tokens | Cache Hit % | Total Tokens/Min | PTU Required | TPM Equivalent |62|---------------|-----------|---------------|-----------------|-------------|------------------|--------------|----------------|63| **RAG Chat** | 10 | 3,500 | 300 | 20% | 38,000 | 100 | 38K TPM |64| **Basic Chat** | 10 | 500 | 100 | 20% | 6,000 | 100 | 6K TPM |65| **Summarization** | 10 | 5,000 | 300 | 20% | 53,000 | 100 | 53K TPM |66| **Classification** | 10 | 3,800 | 10 | 20% | 38,100 | 100 | 38K TPM |6768**How to Estimate Your Production Quota Requirements:**6970To calculate your quota needs for production deployments, follow these steps:71721. **Determine your peak calls per minute**: Monitor or estimate maximum concurrent requests732. **Measure token usage**: Average prompt size + response size743. **Account for cache hits**: Prompt caching can reduce effective token count by 20-50%754. **Calculate total tokens/min**: (Calls/min × (Prompt tokens + Response tokens)) × (1 - Cache %)765. **Choose deployment type**:77- **TPM (Standard)**: Allocate 1.5-2× your calculated tokens/min for headroom78- **PTU (Provisioned)**: Use Azure AI Foundry portal PTU calculator for exact PTU count (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)7980**Example Calculation (RAG Chat Production):**81- Peak: 10 calls/min82- Prompt: 3,500 tokens (context + question)83- Response: 300 tokens (answer)84- Cache: 20% hit rate (reduces prompt tokens by 20%)85- **Total TPM needed**: (10 × (3,500 × 0.8 + 300)) = 31,000 TPM86- **With 50% headroom**: 46,500 TPM → Round to **50K TPM deployment**8788**PTU Recommendation:**89For the combined workload (40 calls/min, 135K tokens/min total), use **200 PTU** (from calculator above).9091---9293## Model Selection and Deployment Type Guidance9495> **Official Documentation:**96> - [Choose the Right AI Model for Your Workload](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/choose-ai-model) - Microsoft Architecture Center97> - [Azure OpenAI Models](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models) - Model capabilities, regions, and quotas98> - [Understanding Deployment Types](https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/concepts/deployment-types) - Standard vs Provisioned guidance99100**Model Characteristics** (from [official Azure OpenAI documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models)):101102| Model | Key Characteristics | Best For |103|-------|---------------------|----------|104| **GPT-4o** | Matches GPT-4 Turbo performance in English text/coding, superior in non-English and vision tasks. Cheaper and faster than GPT-4 Turbo. | Multimodal tasks, cost-effective general purpose, high-volume production workloads |105| **GPT-4 Turbo** | Superior reasoning capabilities, larger context window (128K tokens) | Complex reasoning tasks, long-context analysis |106| **GPT-3.5 Turbo** | Most cost-effective, optimized for chat and completions, fast response time | Simple tasks, customer service, high-volume low-cost scenarios |107| **GPT-4o mini** | Fastest response time, low latency | Latency-sensitive applications requiring immediate responses |108| **text-embedding-3-large** | Purpose-built for vector embeddings | RAG applications, semantic search, document similarity |109110**Deployment Type Selection** (from [official deployment types guide](https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/concepts/deployment-types)):111112| Traffic Pattern | Recommended Deployment Type | Reason |113|-----------------|---------------------------|---------|114| **Variable, bursty traffic** | Standard or Global Standard (pay-per-token) | No commitment, pay only for usage |115| **Consistent high volume** | Provisioned types (PTU) | Reserved capacity, predictable costs |116| **Large batch jobs (non-time-sensitive)** | Global Batch or DataZone Batch | 50% cost savings vs Standard |117| **Low latency variance required** | Provisioned types | Guaranteed throughput, no rate limits |118| **No regional restrictions** | Global Standard or Global Provisioned | Access to best available capacity |119120**Capacity Planning Approach** (from [PTU onboarding guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding)):121122To calculate and estimate your capacity requirements:1231241. **Calculate your TPM requirements**: Determine required tokens per minute based on your expected workload1252. **Use the built-in capacity planner**: Available in Azure AI Foundry portal (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)1263. **Input your metrics**: Enter input TPM and output TPM based on your workload characteristics1274. **Get PTU recommendation**: The calculator provides PTU allocation recommendation1285. **Compare costs**: Evaluate Standard (TPM) vs Provisioned (PTU) using the official pricing calculator129130> **Note**: Microsoft does not publish specific "X requests/day = Y TPM" recommendations as capacity requirements vary significantly based on prompt size, response length, cache hit rates, and model choice. Use the built-in capacity planner with your actual workload characteristics.131