Source from repo
Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
145
Skill
n/a
Size
893.9 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
quota/references/capacity-planning.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown131 linesFree
quota/references/capacity-planning.md
1# Capacity Planning Guide
2 
3Comprehensive guide for planning Azure AI Foundry capacity, including cost analysis, model selection, and workload calculations.
4 
5**Table of Contents:** [Cost Comparison: TPM vs PTU](#cost-comparison-tpm-vs-ptu) · [Production Workload Examples](#production-workload-examples) · [Model Selection and Deployment Type Guidance](#model-selection-and-deployment-type-guidance)
6 
7## Cost Comparison: TPM vs PTU
8 
9> **Official Pricing Sources:**
10> - [Azure OpenAI Service Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/) - Official pay-per-token rates
11> - [PTU Costs and Billing Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding) - PTU hourly rates and capacity planning
12 
13**TPM (Standard) Pricing:**
14- Pay-per-token for input/output
15- No upfront commitment
16- **Rates**: See [Azure OpenAI Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/)
17  - GPT-4o: ~$0.0025-$0.01/1K tokens
18  - GPT-4 Turbo: ~$0.01-$0.03/1K
19  - GPT-3.5 Turbo: ~$0.0005-$0.0015/1K
20- **Best for**: Variable workloads, unpredictable traffic
21 
22**PTU (Provisioned) Pricing:**
23- Hourly billing: `$/PTU/hr × PTUs × 730 hrs/month`
24- Monthly commitment with Reservations discounts
25- **Rates**: See [PTU Billing Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding)
26- Use PTU calculator to determine requirements (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
27- **Best for**: High-volume (>1M tokens/day), predictable traffic, guaranteed throughput
28 
29**Cost Decision Framework** (Analytical Guidance):
30 
31```
32Step 1: Calculate monthly TPM cost
33  Monthly TPM cost = (Daily tokens × 30 days × $price per 1K tokens) / 1000
34 
35Step 2: Calculate monthly PTU cost
36  Monthly PTU cost = Required PTUs × 730 hours/month × $PTU-hour rate
37  (Get Required PTUs from Azure AI Foundry portal: Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
38 
39Step 3: Compare
40  Use PTU when: Monthly PTU cost < (Monthly TPM cost × 0.7)
41  (Use 70% threshold to account for commitment risk)
42```
43 
44**Example Calculation** (Analytical):
45 
46Scenario: 1M requests/day, average 1,000 tokens per request
47 
48- **Daily tokens**: 1,000,000 × 1,000 = 1B tokens/day
49- **TPM Cost** (using GPT-4o at $0.005/1K avg): (1B × 30 × $0.005) / 1000 = ~$150,000/month
50- **PTU Cost** (estimated 100 PTU at ~$5/PTU-hour): 100 PTU × 730 hours × $5 = ~$365,000/month
51- **Decision**: Use TPM (significantly lower cost for this workload)
52 
53> **Important**: Always use the official [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator/) and Azure AI Foundry portal PTU calculator (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab) for exact pricing by model, region, and workload. Prices vary by region and are subject to change.
54 
55---
56 
57## Production Workload Examples
58 
59To estimate quota requirements, use real-world production scenarios with capacity calculations for gpt-4, version 0613 (from Azure Foundry Portal calculator):
60 
61| Workload Type | Calls/Min | Prompt Tokens | Response Tokens | Cache Hit % | Total Tokens/Min | PTU Required | TPM Equivalent |
62|---------------|-----------|---------------|-----------------|-------------|------------------|--------------|----------------|
63| **RAG Chat** | 10 | 3,500 | 300 | 20% | 38,000 | 100 | 38K TPM |
64| **Basic Chat** | 10 | 500 | 100 | 20% | 6,000 | 100 | 6K TPM |
65| **Summarization** | 10 | 5,000 | 300 | 20% | 53,000 | 100 | 53K TPM |
66| **Classification** | 10 | 3,800 | 10 | 20% | 38,100 | 100 | 38K TPM |
67 
68**How to Estimate Your Production Quota Requirements:**
69 
70To calculate your quota needs for production deployments, follow these steps:
71 
721. **Determine your peak calls per minute**: Monitor or estimate maximum concurrent requests
732. **Measure token usage**: Average prompt size + response size
743. **Account for cache hits**: Prompt caching can reduce effective token count by 20-50%
754. **Calculate total tokens/min**: (Calls/min × (Prompt tokens + Response tokens)) × (1 - Cache %)
765. **Choose deployment type**:
77   - **TPM (Standard)**: Allocate 1.5-2× your calculated tokens/min for headroom
78   - **PTU (Provisioned)**: Use Azure AI Foundry portal PTU calculator for exact PTU count (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
79 
80**Example Calculation (RAG Chat Production):**
81- Peak: 10 calls/min
82- Prompt: 3,500 tokens (context + question)
83- Response: 300 tokens (answer)
84- Cache: 20% hit rate (reduces prompt tokens by 20%)
85- **Total TPM needed**: (10 × (3,500 × 0.8 + 300)) = 31,000 TPM
86- **With 50% headroom**: 46,500 TPM → Round to **50K TPM deployment**
87 
88**PTU Recommendation:**
89For the combined workload (40 calls/min, 135K tokens/min total), use **200 PTU** (from calculator above).
90 
91---
92 
93## Model Selection and Deployment Type Guidance
94 
95> **Official Documentation:**
96> - [Choose the Right AI Model for Your Workload](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/choose-ai-model) - Microsoft Architecture Center
97> - [Azure OpenAI Models](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models) - Model capabilities, regions, and quotas
98> - [Understanding Deployment Types](https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/concepts/deployment-types) - Standard vs Provisioned guidance
99 
100**Model Characteristics** (from [official Azure OpenAI documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models)):
101 
102| Model | Key Characteristics | Best For |
103|-------|---------------------|----------|
104| **GPT-4o** | Matches GPT-4 Turbo performance in English text/coding, superior in non-English and vision tasks. Cheaper and faster than GPT-4 Turbo. | Multimodal tasks, cost-effective general purpose, high-volume production workloads |
105| **GPT-4 Turbo** | Superior reasoning capabilities, larger context window (128K tokens) | Complex reasoning tasks, long-context analysis |
106| **GPT-3.5 Turbo** | Most cost-effective, optimized for chat and completions, fast response time | Simple tasks, customer service, high-volume low-cost scenarios |
107| **GPT-4o mini** | Fastest response time, low latency | Latency-sensitive applications requiring immediate responses |
108| **text-embedding-3-large** | Purpose-built for vector embeddings | RAG applications, semantic search, document similarity |
109 
110**Deployment Type Selection** (from [official deployment types guide](https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/concepts/deployment-types)):
111 
112| Traffic Pattern | Recommended Deployment Type | Reason |
113|-----------------|---------------------------|---------|
114| **Variable, bursty traffic** | Standard or Global Standard (pay-per-token) | No commitment, pay only for usage |
115| **Consistent high volume** | Provisioned types (PTU) | Reserved capacity, predictable costs |
116| **Large batch jobs (non-time-sensitive)** | Global Batch or DataZone Batch | 50% cost savings vs Standard |
117| **Low latency variance required** | Provisioned types | Guaranteed throughput, no rate limits |
118| **No regional restrictions** | Global Standard or Global Provisioned | Access to best available capacity |
119 
120**Capacity Planning Approach** (from [PTU onboarding guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding)):
121 
122To calculate and estimate your capacity requirements:
123 
1241. **Calculate your TPM requirements**: Determine required tokens per minute based on your expected workload
1252. **Use the built-in capacity planner**: Available in Azure AI Foundry portal (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
1263. **Input your metrics**: Enter input TPM and output TPM based on your workload characteristics
1274. **Get PTU recommendation**: The calculator provides PTU allocation recommendation
1285. **Compare costs**: Evaluate Standard (TPM) vs Provisioned (PTU) using the official pricing calculator
129 
130> **Note**: Microsoft does not publish specific "X requests/day = Y TPM" recommendations as capacity requirements vary significantly based on prompt size, response length, cache hit rates, and model choice. Use the built-in capacity planner with your actual workload characteristics.
131
Preparing the source view

Microsoft Foundry Skill

quota/references/capacity-planning.md