Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
quota/references/capacity-planning.md
1# Capacity Planning Guide23Comprehensive guide for planning Azure AI Foundry capacity, including cost analysis, model selection, and workload calculations.45**Table of Contents:** [Cost Comparison: TPM vs PTU](#cost-comparison-tpm-vs-ptu) · [Production Workload Examples](#production-workload-examples) · [Model Selection and Deployment Type Guidance](#model-selection-and-deployment-type-guidance)67## Cost Comparison: TPM vs PTU89> **Official Pricing Sources:**10> - [Azure OpenAI Service Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/) - Official pay-per-token rates11> - [PTU Costs and Billing Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding) - PTU hourly rates and capacity planning1213**TPM (Standard) Pricing:**14- Pay-per-token for input/output15- No upfront commitment16- **Rates**: See [Azure OpenAI Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/)17- GPT-4o: ~$0.0025-$0.01/1K tokens18- GPT-4 Turbo: ~$0.01-$0.03/1K19- GPT-3.5 Turbo: ~$0.0005-$0.0015/1K20- **Best for**: Variable workloads, unpredictable traffic2122**PTU (Provisioned) Pricing:**23- Hourly billing: `$/PTU/hr × PTUs × 730 hrs/month`24- Monthly commitment with Reservations discounts25- **Rates**: See [PTU Billing Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding)26- Use PTU calculator to determine requirements (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)27- **Best for**: High-volume (>1M tokens/day), predictable traffic, guaranteed throughput2829**Cost Decision Framework** (Analytical Guidance):3031```32Step 1: Calculate monthly TPM cost33Monthly TPM cost = (Daily tokens × 30 days × $price per 1K tokens) / 10003435Step 2: Calculate monthly PTU cost36Monthly PTU cost = Required PTUs × 730 hours/month × $PTU-hour rate37(Get Required PTUs from Azure AI Foundry portal: Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)3839Step 3: Compare40Use PTU when: Monthly PTU cost < (Monthly TPM cost × 0.7)41(Use 70% threshold to account for commitment risk)42```4344**Example Calculation** (Analytical):4546Scenario: 1M requests/day, average 1,000 tokens per request4748- **Daily tokens**: 1,000,000 × 1,000 = 1B tokens/day49- **TPM Cost** (using GPT-4o at $0.005/1K avg): (1B × 30 × $0.005) / 1000 = ~$150,000/month50- **PTU Cost** (estimated 100 PTU at ~$5/PTU-hour): 100 PTU × 730 hours × $5 = ~$365,000/month51- **Decision**: Use TPM (significantly lower cost for this workload)5253> **Important**: Always use the official [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator/) and Azure AI Foundry portal PTU calculator (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab) for exact pricing by model, region, and workload. Prices vary by region and are subject to change.5455---5657## Production Workload Examples5859To estimate quota requirements, use real-world production scenarios with capacity calculations for gpt-4, version 0613 (from Azure Foundry Portal calculator):6061| Workload Type | Calls/Min | Prompt Tokens | Response Tokens | Cache Hit % | Total Tokens/Min | PTU Required | TPM Equivalent |62|---------------|-----------|---------------|-----------------|-------------|------------------|--------------|----------------|63| **RAG Chat** | 10 | 3,500 | 300 | 20% | 38,000 | 100 | 38K TPM |64| **Basic Chat** | 10 | 500 | 100 | 20% | 6,000 | 100 | 6K TPM |65| **Summarization** | 10 | 5,000 | 300 | 20% | 53,000 | 100 | 53K TPM |66| **Classification** | 10 | 3,800 | 10 | 20% | 38,100 | 100 | 38K TPM |6768**How to Estimate Your Production Quota Requirements:**6970To calculate your quota needs for production deployments, follow these steps:71721. **Determine your peak calls per minute**: Monitor or estimate maximum concurrent requests732. **Measure token usage**: Average prompt size + response size743. **Account for cache hits**: Prompt caching can reduce effective token count by 20-50%754. **Calculate total tokens/min**: (Calls/min × (Prompt tokens + Response tokens)) × (1 - Cache %)765. **Choose deployment type**:77- **TPM (Standard)**: Allocate 1.5-2× your calculated tokens/min for headroom78- **PTU (Provisioned)**: Use Azure AI Foundry portal PTU calculator for exact PTU count (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)7980**Example Calculation (RAG Chat Production):**81- Peak: 10 calls/min82- Prompt: 3,500 tokens (context + question)83- Response: 300 tokens (answer)84- Cache: 20% hit rate (reduces prompt tokens by 20%)85- **Total TPM needed**: (10 × (3,500 × 0.8 + 300)) = 31,000 TPM86- **With 50% headroom**: 46,500 TPM → Round to **50K TPM deployment**8788**PTU Recommendation:**89For the combined workload (40 calls/min, 135K tokens/min total), use **200 PTU** (from calculator above).9091---9293## Model Selection and Deployment Type Guidance9495> **Official Documentation:**96> - [Choose the Right AI Model for Your Workload](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/choose-ai-model) - Microsoft Architecture Center97> - [Azure OpenAI Models](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models) - Model capabilities, regions, and quotas98> - [Understanding Deployment Types](https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/concepts/deployment-types) - Standard vs Provisioned guidance99100**Model Characteristics** (from [official Azure OpenAI documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models)):101102| Model | Key Characteristics | Best For |103|-------|---------------------|----------|104| **GPT-4o** | Matches GPT-4 Turbo performance in English text/coding, superior in non-English and vision tasks. Cheaper and faster than GPT-4 Turbo. | Multimodal tasks, cost-effective general purpose, high-volume production workloads |105| **GPT-4 Turbo** | Superior reasoning capabilities, larger context window (128K tokens) | Complex reasoning tasks, long-context analysis |106| **GPT-3.5 Turbo** | Most cost-effective, optimized for chat and completions, fast response time | Simple tasks, customer service, high-volume low-cost scenarios |107| **GPT-4o mini** | Fastest response time, low latency | Latency-sensitive applications requiring immediate responses |108| **text-embedding-3-large** | Purpose-built for vector embeddings | RAG applications, semantic search, document similarity |109110**Deployment Type Selection** (from [official deployment types guide](https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/concepts/deployment-types)):111112| Traffic Pattern | Recommended Deployment Type | Reason |113|-----------------|---------------------------|---------|114| **Variable, bursty traffic** | Standard or Global Standard (pay-per-token) | No commitment, pay only for usage |115| **Consistent high volume** | Provisioned types (PTU) | Reserved capacity, predictable costs |116| **Large batch jobs (non-time-sensitive)** | Global Batch or DataZone Batch | 50% cost savings vs Standard |117| **Low latency variance required** | Provisioned types | Guaranteed throughput, no rate limits |118| **No regional restrictions** | Global Standard or Global Provisioned | Access to best available capacity |119120**Capacity Planning Approach** (from [PTU onboarding guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding)):121122To calculate and estimate your capacity requirements:1231241. **Calculate your TPM requirements**: Determine required tokens per minute based on your expected workload1252. **Use the built-in capacity planner**: Available in Azure AI Foundry portal (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)1263. **Input your metrics**: Enter input TPM and output TPM based on your workload characteristics1274. **Get PTU recommendation**: The calculator provides PTU allocation recommendation1285. **Compare costs**: Evaluate Standard (TPM) vs Provisioned (PTU) using the official pricing calculator129130> **Note**: Microsoft does not publish specific "X requests/day = Y TPM" recommendations as capacity requirements vary significantly based on prompt size, response length, cache hit rates, and model choice. Use the built-in capacity planner with your actual workload characteristics.131