Source from repo
Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
546.6 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
quota/references/capacity-planning.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown131 linesFree
quota/references/capacity-planning.md
1# Capacity Planning Guide
2 
3Comprehensive guide for planning Azure AI Foundry capacity, including cost analysis, model selection, and workload calculations.
4 
5**Table of Contents:** [Cost Comparison: TPM vs PTU](#cost-comparison-tpm-vs-ptu) · [Production Workload Examples](#production-workload-examples) · [Model Selection and Deployment Type Guidance](#model-selection-and-deployment-type-guidance)
6 
7## Cost Comparison: TPM vs PTU
8 
9> **Official Pricing Sources:**
10> - [Azure OpenAI Service Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/) - Official pay-per-token rates
11> - [PTU Costs and Billing Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding) - PTU hourly rates and capacity planning
12 
13**TPM (Standard) Pricing:**
14- Pay-per-token for input/output
15- No upfront commitment
16- **Rates**: See [Azure OpenAI Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/)
17  - GPT-4o: ~$0.0025-$0.01/1K tokens
18  - GPT-4 Turbo: ~$0.01-$0.03/1K
19  - GPT-3.5 Turbo: ~$0.0005-$0.0015/1K
20- **Best for**: Variable workloads, unpredictable traffic
21 
22**PTU (Provisioned) Pricing:**
23- Hourly billing: `$/PTU/hr × PTUs × 730 hrs/month`
24- Monthly commitment with Reservations discounts
25- **Rates**: See [PTU Billing Guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding)
26- Use PTU calculator to determine requirements (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
27- **Best for**: High-volume (>1M tokens/day), predictable traffic, guaranteed throughput
28 
29**Cost Decision Framework** (Analytical Guidance):
30 
31```
32Step 1: Calculate monthly TPM cost
33  Monthly TPM cost = (Daily tokens × 30 days × $price per 1K tokens) / 1000
34 
35Step 2: Calculate monthly PTU cost
36  Monthly PTU cost = Required PTUs × 730 hours/month × $PTU-hour rate
37  (Get Required PTUs from Azure AI Foundry portal: Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
38 
39Step 3: Compare
40  Use PTU when: Monthly PTU cost < (Monthly TPM cost × 0.7)
41  (Use 70% threshold to account for commitment risk)
42```
43 
44**Example Calculation** (Analytical):
45 
46Scenario: 1M requests/day, average 1,000 tokens per request
47 
48- **Daily tokens**: 1,000,000 × 1,000 = 1B tokens/day
49- **TPM Cost** (using GPT-4o at $0.005/1K avg): (1B × 30 × $0.005) / 1000 = ~$150,000/month
50- **PTU Cost** (estimated 100 PTU at ~$5/PTU-hour): 100 PTU × 730 hours × $5 = ~$365,000/month
51- **Decision**: Use TPM (significantly lower cost for this workload)
52 
53> **Important**: Always use the official [Azure Pricing Calculator](https://azure.microsoft.com/pricing/calculator/) and Azure AI Foundry portal PTU calculator (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab) for exact pricing by model, region, and workload. Prices vary by region and are subject to change.
54 
55---
56 
57## Production Workload Examples
58 
59To estimate quota requirements, use real-world production scenarios with capacity calculations for gpt-4, version 0613 (from Azure Foundry Portal calculator):
60 
61| Workload Type | Calls/Min | Prompt Tokens | Response Tokens | Cache Hit % | Total Tokens/Min | PTU Required | TPM Equivalent |
62|---------------|-----------|---------------|-----------------|-------------|------------------|--------------|----------------|
63| **RAG Chat** | 10 | 3,500 | 300 | 20% | 38,000 | 100 | 38K TPM |
64| **Basic Chat** | 10 | 500 | 100 | 20% | 6,000 | 100 | 6K TPM |
65| **Summarization** | 10 | 5,000 | 300 | 20% | 53,000 | 100 | 53K TPM |
66| **Classification** | 10 | 3,800 | 10 | 20% | 38,100 | 100 | 38K TPM |
67 
68**How to Estimate Your Production Quota Requirements:**
69 
70To calculate your quota needs for production deployments, follow these steps:
71 
721. **Determine your peak calls per minute**: Monitor or estimate maximum concurrent requests
732. **Measure token usage**: Average prompt size + response size
743. **Account for cache hits**: Prompt caching can reduce effective token count by 20-50%
754. **Calculate total tokens/min**: (Calls/min × (Prompt tokens + Response tokens)) × (1 - Cache %)
765. **Choose deployment type**:
77   - **TPM (Standard)**: Allocate 1.5-2× your calculated tokens/min for headroom
78   - **PTU (Provisioned)**: Use Azure AI Foundry portal PTU calculator for exact PTU count (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
79 
80**Example Calculation (RAG Chat Production):**
81- Peak: 10 calls/min
82- Prompt: 3,500 tokens (context + question)
83- Response: 300 tokens (answer)
84- Cache: 20% hit rate (reduces prompt tokens by 20%)
85- **Total TPM needed**: (10 × (3,500 × 0.8 + 300)) = 31,000 TPM
86- **With 50% headroom**: 46,500 TPM → Round to **50K TPM deployment**
87 
88**PTU Recommendation:**
89For the combined workload (40 calls/min, 135K tokens/min total), use **200 PTU** (from calculator above).
90 
91---
92 
93## Model Selection and Deployment Type Guidance
94 
95> **Official Documentation:**
96> - [Choose the Right AI Model for Your Workload](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/choose-ai-model) - Microsoft Architecture Center
97> - [Azure OpenAI Models](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models) - Model capabilities, regions, and quotas
98> - [Understanding Deployment Types](https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/concepts/deployment-types) - Standard vs Provisioned guidance
99 
100**Model Characteristics** (from [official Azure OpenAI documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models)):
101 
102| Model | Key Characteristics | Best For |
103|-------|---------------------|----------|
104| **GPT-4o** | Matches GPT-4 Turbo performance in English text/coding, superior in non-English and vision tasks. Cheaper and faster than GPT-4 Turbo. | Multimodal tasks, cost-effective general purpose, high-volume production workloads |
105| **GPT-4 Turbo** | Superior reasoning capabilities, larger context window (128K tokens) | Complex reasoning tasks, long-context analysis |
106| **GPT-3.5 Turbo** | Most cost-effective, optimized for chat and completions, fast response time | Simple tasks, customer service, high-volume low-cost scenarios |
107| **GPT-4o mini** | Fastest response time, low latency | Latency-sensitive applications requiring immediate responses |
108| **text-embedding-3-large** | Purpose-built for vector embeddings | RAG applications, semantic search, document similarity |
109 
110**Deployment Type Selection** (from [official deployment types guide](https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/concepts/deployment-types)):
111 
112| Traffic Pattern | Recommended Deployment Type | Reason |
113|-----------------|---------------------------|---------|
114| **Variable, bursty traffic** | Standard or Global Standard (pay-per-token) | No commitment, pay only for usage |
115| **Consistent high volume** | Provisioned types (PTU) | Reserved capacity, predictable costs |
116| **Large batch jobs (non-time-sensitive)** | Global Batch or DataZone Batch | 50% cost savings vs Standard |
117| **Low latency variance required** | Provisioned types | Guaranteed throughput, no rate limits |
118| **No regional restrictions** | Global Standard or Global Provisioned | Access to best available capacity |
119 
120**Capacity Planning Approach** (from [PTU onboarding guide](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/provisioned-throughput-onboarding)):
121 
122To calculate and estimate your capacity requirements:
123 
1241. **Calculate your TPM requirements**: Determine required tokens per minute based on your expected workload
1252. **Use the built-in capacity planner**: Available in Azure AI Foundry portal (Microsoft Foundry → Operate → Quota → Provisioned Throughput Unit tab)
1263. **Input your metrics**: Enter input TPM and output TPM based on your workload characteristics
1274. **Get PTU recommendation**: The calculator provides PTU allocation recommendation
1285. **Compare costs**: Evaluate Standard (TPM) vs Provisioned (PTU) using the official pricing calculator
129 
130> **Note**: Microsoft does not publish specific "X requests/day = Y TPM" recommendations as capacity requirements vary significantly based on prompt size, response length, cache hit rates, and model choice. Use the built-in capacity planner with your actual workload characteristics.
131
Preparing the source view

Microsoft Foundry Skill

quota/references/capacity-planning.md