Source from repo
Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page
Files
Skill
n/a
Size
546.6 KB
Entrypoint
SKILL.md
Format
git-repo
Open file
quota/references/optimization.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown169 linesFree
quota/references/optimization.md
1# Quota Optimization Strategies
2 
3Comprehensive strategies for optimizing Azure AI Foundry quota allocation and reducing costs.
4 
5**Table of Contents:** [1. Identify and Delete Unused Deployments](#1-identify-and-delete-unused-deployments) · [2. Right-Size Over-Provisioned Deployments](#2-right-size-over-provisioned-deployments) · [3. Consolidate Multiple Small Deployments](#3-consolidate-multiple-small-deployments) · [4. Cost Optimization Strategies](#4-cost-optimization-strategies) · [5. Regional Quota Rebalancing](#5-regional-quota-rebalancing)
6 
7## 1. Identify and Delete Unused Deployments
8 
9**Step 1: Discovery with Quota Context**
10 
11Get quota limits FIRST to understand how close you are to capacity:
12 
13```bash
14# Check current quota usage vs limits (run this FIRST)
15subId=$(az account show --query id -o tsv)
16region="eastus"  # Change to your region
17az rest --method get \
18  --url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \
19  --query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit, Available:'(Limit - Used)'}" -o table
20```
21 
22**Step 2: Parallel Deployment Enumeration**
23 
24List all deployments across resources efficiently:
25 
26```bash
27# Get all Foundry resources
28resources=$(az cognitiveservices account list --query "[?kind=='AIServices'].{name:name,rg:resourceGroup}" -o json)
29 
30# Parallel deployment enumeration (faster than sequential)
31echo "$resources" | jq -r '.[] | "\(.name) \(.rg)"' | while read name rg; do
32  echo "=== $name ($rg) ==="
33  az cognitiveservices account deployment list --name "$name" --resource-group "$rg" \
34    --query "[].{Deployment:name,Model:properties.model.name,Capacity:sku.capacity,Created:systemData.createdAt}" -o table &
35done
36wait  # Wait for all background jobs to complete
37```
38 
39**Step 3: Identify Stale Deployments**
40 
41Criteria for deletion candidates:
42 
43- **Test/temporary naming**: Contains "test", "demo", "temp", "dev" in deployment name
44- **Old timestamps**: Created >90 days ago with timestamp-based naming (e.g., "gpt4-20231015")
45- **High capacity consumers**: Deployments with >100K TPM capacity that haven't been referenced in recent logs
46- **Duplicate models**: Multiple deployments of same model/version in same region
47 
48**Example pattern matching for stale deployments:**
49```bash
50# Find deployments with test/temp naming
51az cognitiveservices account deployment list --name <resource> --resource-group <rg> \
52  --query "[?contains(name,'test') || contains(name,'demo') || contains(name,'temp')].{Name:name,Capacity:sku.capacity}" -o table
53```
54 
55**Step 4: Delete and Verify Quota Recovery**
56 
57```bash
58# Delete unused deployment (quota freed IMMEDIATELY)
59az cognitiveservices account deployment delete --name <resource> --resource-group <rg> --deployment-name <deployment>
60 
61# Verify quota freed (re-run Step 1 quota check)
62# You should see "Used" decrease by the deployment's capacity
63```
64 
65**Cost Impact Analysis:**
66 
67| Deployment Type | Capacity (TPM) | Quota Freed | Cost Impact (TPM) | Cost Impact (PTU) |
68|-----------------|----------------|-------------|-------------------|-------------------|
69| Test deployment | 10K TPM | 10K TPM | $0 (pay-per-use) | N/A |
70| Unused production | 100K TPM | 100K TPM | $0 (pay-per-use) | N/A |
71| Abandoned PTU deployment | 100 PTU | ~40K TPM equivalent | $0 TPM | **$3,650/month saved** (100 PTU × 730h × $0.05/h) |
72| High-capacity test | 450K TPM | 450K TPM | $0 (pay-per-use) | N/A |
73 
74**Key Insight:** For TPM (Standard) deployments, deletion frees quota but has no direct cost impact (you pay per token used). For PTU (Provisioned) deployments, deletion **immediately stops hourly charges** and can save thousands per month.
75 
76---
77 
78## 2. Right-Size Over-Provisioned Deployments
79 
80**Identify over-provisioned deployments:**
81- Check Azure Monitor metrics for actual token usage
82- Compare allocated TPM vs. peak usage
83- Look for deployments with <50% utilization
84 
85**Right-sizing example:**
86```bash
87# Update deployment to lower capacity
88az cognitiveservices account deployment update --name <resource> --resource-group <rg> \
89  --deployment-name <deployment> --sku-capacity 30  # Reduce from 50K to 30K TPM
90```
91 
92**Cost Optimization:**
93- **TPM (Standard)**: Reduces regional quota consumption (no direct cost savings, pay-per-token)
94- **PTU (Provisioned)**: Direct cost reduction (40% capacity reduction = 40% cost reduction)
95 
96---
97 
98## 3. Consolidate Multiple Small Deployments
99 
100**Pattern:** Multiple 10K TPM deployments → One 30-50K TPM deployment
101 
102**Benefits:**
103- Fewer deployment slots consumed
104- Simpler management
105- Same total capacity, better utilization
106 
107**Example:**
108- **Before**: 3 deployments @ 10K TPM each = 30K TPM total, 3 slots used
109- **After**: 1 deployment @ 30K TPM = 30K TPM total, 1 slot used
110- **Savings**: 2 deployment slots freed for other models
111 
112---
113 
114## 4. Cost Optimization Strategies
115 
116> **Official Documentation**: [Plan to manage costs for Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs) and [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)
117 
118**A. Use Fine-Tuned Smaller Models** (from [Microsoft Transparency Note](https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/openai/transparency-note)):
119 
120You can reduce costs or latency by swapping a fine-tuned version of a smaller/faster model (e.g., fine-tuned GPT-3.5-Turbo) for a more general-purpose model (e.g., GPT-4).
121 
122```bash
123# Deploy fine-tuned GPT-3.5 Turbo as cost-effective alternative to GPT-4
124az cognitiveservices account deployment create --name <resource> --resource-group <rg> \
125  --deployment-name gpt-35-tuned --model-name <your-fine-tuned-model> \
126  --model-format OpenAI --sku-name Standard --sku-capacity 10
127```
128 
129**B. Remove Unused Fine-Tuned Deployments** (from [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)):
130 
131Fine-tuned model deployments incur **hourly hosting costs** even when not in use. Remove unused deployments promptly to control costs.
132 
133- Inactive deployments unused for **15 consecutive days** are automatically deleted
134- Proactively delete unused fine-tuned deployments to avoid hourly charges
135 
136```bash
137# Delete unused fine-tuned deployment
138az cognitiveservices account deployment delete --name <resource> --resource-group <rg> \
139  --deployment-name <unused-fine-tuned-deployment>
140```
141 
142**C. Batch Multiple Requests** (from [Cost optimization Q&A](https://learn.microsoft.com/en-us/answers/questions/1689253/how-to-optimize-costs-per-request-azure-openai-gpt)):
143 
144Batch multiple requests together to reduce the total number of API calls and lower overall costs.
145 
146**D. Use Commitment Tiers for Predictable Costs** (from [Managing costs guide](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs)):
147 
148- **Pay-as-you-go**: Bills according to usage (variable costs)
149- **Commitment tiers**: Commit to using service features for a fixed fee (predictable costs, potential savings for consistent usage)
150 
151---
152 
153## 5. Regional Quota Rebalancing
154 
155If you have quota spread across multiple regions but only use some:
156 
157```bash
158# Check quota across regions
159for region in eastus westus uksouth; do
160  echo "=== $region ==="
161  subId=$(az account show --query id -o tsv)
162  az rest --method get \
163    --url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \
164    --query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit}" -o table
165done
166```
167 
168**Optimization:** Concentrate deployments in fewer regions to maximize quota utilization per region.
169
Preparing the source view

Microsoft Foundry Skill

quota/references/optimization.md