Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

152

Skill

n/a

Size

941.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

quota/references/optimization.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown169 linesFree

quota/references/optimization.md

1# Quota Optimization Strategies
2 
3Comprehensive strategies for optimizing Azure AI Foundry quota allocation and reducing costs.
4 
5**Table of Contents:** [1. Identify and Delete Unused Deployments](#1-identify-and-delete-unused-deployments) · [2. Right-Size Over-Provisioned Deployments](#2-right-size-over-provisioned-deployments) · [3. Consolidate Multiple Small Deployments](#3-consolidate-multiple-small-deployments) · [4. Cost Optimization Strategies](#4-cost-optimization-strategies) · [5. Regional Quota Rebalancing](#5-regional-quota-rebalancing)
6 
7## 1. Identify and Delete Unused Deployments
8 
9**Step 1: Discovery with Quota Context**
10 
11Get quota limits FIRST to understand how close you are to capacity:
12 
13```bash
14# Check current quota usage vs limits (run this FIRST)
15subId=$(az account show --query id -o tsv)
16region="eastus"  # Change to your region
17az rest --method get \
18  --url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \
19  --query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit, Available:'(Limit - Used)'}" -o table
20```
21 
22**Step 2: Parallel Deployment Enumeration**
23 
24List all deployments across resources efficiently:
25 
26```bash
27# Get all Foundry resources
28resources=$(az cognitiveservices account list --query "[?kind=='AIServices'].{name:name,rg:resourceGroup}" -o json)
29 
30# Parallel deployment enumeration (faster than sequential)
31echo "$resources" | jq -r '.[] | "\(.name) \(.rg)"' | while read name rg; do
32  echo "=== $name ($rg) ==="
33  az cognitiveservices account deployment list --name "$name" --resource-group "$rg" \
34    --query "[].{Deployment:name,Model:properties.model.name,Capacity:sku.capacity,Created:systemData.createdAt}" -o table &
35done
36wait  # Wait for all background jobs to complete
37```
38 
39**Step 3: Identify Stale Deployments**
40 
41Criteria for deletion candidates:
42 
43- **Test/temporary naming**: Contains "test", "demo", "temp", "dev" in deployment name
44- **Old timestamps**: Created >90 days ago with timestamp-based naming (e.g., "gpt4-20231015")
45- **High capacity consumers**: Deployments with >100K TPM capacity that haven't been referenced in recent logs
46- **Duplicate models**: Multiple deployments of same model/version in same region
47 
48**Example pattern matching for stale deployments:**
49```bash
50# Find deployments with test/temp naming
51az cognitiveservices account deployment list --name <resource> --resource-group <rg> \
52  --query "[?contains(name,'test') || contains(name,'demo') || contains(name,'temp')].{Name:name,Capacity:sku.capacity}" -o table
53```
54 
55**Step 4: Delete and Verify Quota Recovery**
56 
57```bash
58# Delete unused deployment (quota freed IMMEDIATELY)
59az cognitiveservices account deployment delete --name <resource> --resource-group <rg> --deployment-name <deployment>
60 
61# Verify quota freed (re-run Step 1 quota check)
62# You should see "Used" decrease by the deployment's capacity
63```
64 
65**Cost Impact Analysis:**
66 
67| Deployment Type | Capacity (TPM) | Quota Freed | Cost Impact (TPM) | Cost Impact (PTU) |
68|-----------------|----------------|-------------|-------------------|-------------------|
69| Test deployment | 10K TPM | 10K TPM | $0 (pay-per-use) | N/A |
70| Unused production | 100K TPM | 100K TPM | $0 (pay-per-use) | N/A |
71| Abandoned PTU deployment | 100 PTU | ~40K TPM equivalent | $0 TPM | **$3,650/month saved** (100 PTU × 730h × $0.05/h) |
72| High-capacity test | 450K TPM | 450K TPM | $0 (pay-per-use) | N/A |
73 
74**Key Insight:** For TPM (Standard) deployments, deletion frees quota but has no direct cost impact (you pay per token used). For PTU (Provisioned) deployments, deletion **immediately stops hourly charges** and can save thousands per month.
75 
76---
77 
78## 2. Right-Size Over-Provisioned Deployments
79 
80**Identify over-provisioned deployments:**
81- Check Azure Monitor metrics for actual token usage
82- Compare allocated TPM vs. peak usage
83- Look for deployments with <50% utilization
84 
85**Right-sizing example:**
86```bash
87# Update deployment to lower capacity
88az cognitiveservices account deployment update --name <resource> --resource-group <rg> \
89  --deployment-name <deployment> --sku-capacity 30  # Reduce from 50K to 30K TPM
90```
91 
92**Cost Optimization:**
93- **TPM (Standard)**: Reduces regional quota consumption (no direct cost savings, pay-per-token)
94- **PTU (Provisioned)**: Direct cost reduction (40% capacity reduction = 40% cost reduction)
95 
96---
97 
98## 3. Consolidate Multiple Small Deployments
99 
100**Pattern:** Multiple 10K TPM deployments → One 30-50K TPM deployment
101 
102**Benefits:**
103- Fewer deployment slots consumed
104- Simpler management
105- Same total capacity, better utilization
106 
107**Example:**
108- **Before**: 3 deployments @ 10K TPM each = 30K TPM total, 3 slots used
109- **After**: 1 deployment @ 30K TPM = 30K TPM total, 1 slot used
110- **Savings**: 2 deployment slots freed for other models
111 
112---
113 
114## 4. Cost Optimization Strategies
115 
116> **Official Documentation**: [Plan to manage costs for Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs) and [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)
117 
118**A. Use Fine-Tuned Smaller Models** (from [Microsoft Transparency Note](https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/openai/transparency-note)):
119 
120You can reduce costs or latency by swapping a fine-tuned version of a smaller/faster model (e.g., fine-tuned GPT-3.5-Turbo) for a more general-purpose model (e.g., GPT-4).
121 
122```bash
123# Deploy fine-tuned GPT-3.5 Turbo as cost-effective alternative to GPT-4
124az cognitiveservices account deployment create --name <resource> --resource-group <rg> \
125  --deployment-name gpt-35-tuned --model-name <your-fine-tuned-model> \
126  --model-format OpenAI --sku-name Standard --sku-capacity 10
127```
128 
129**B. Remove Unused Fine-Tuned Deployments** (from [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)):
130 
131Fine-tuned model deployments incur **hourly hosting costs** even when not in use. Remove unused deployments promptly to control costs.
132 
133- Inactive deployments unused for **15 consecutive days** are automatically deleted
134- Proactively delete unused fine-tuned deployments to avoid hourly charges
135 
136```bash
137# Delete unused fine-tuned deployment
138az cognitiveservices account deployment delete --name <resource> --resource-group <rg> \
139  --deployment-name <unused-fine-tuned-deployment>
140```
141 
142**C. Batch Multiple Requests** (from [Cost optimization Q&A](https://learn.microsoft.com/en-us/answers/questions/1689253/how-to-optimize-costs-per-request-azure-openai-gpt)):
143 
144Batch multiple requests together to reduce the total number of API calls and lower overall costs.
145 
146**D. Use Commitment Tiers for Predictable Costs** (from [Managing costs guide](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs)):
147 
148- **Pay-as-you-go**: Bills according to usage (variable costs)
149- **Commitment tiers**: Commit to using service features for a fixed fee (predictable costs, potential savings for consistent usage)
150 
151---
152 
153## 5. Regional Quota Rebalancing
154 
155If you have quota spread across multiple regions but only use some:
156 
157```bash
158# Check quota across regions
159for region in eastus westus uksouth; do
160  echo "=== $region ==="
161  subId=$(az account show --query id -o tsv)
162  az rest --method get \
163    --url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \
164    --query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit}" -o table
165done
166```
167 
168**Optimization:** Concentrate deployments in fewer regions to maximize quota utilization per region.
169

Loading source

Preparing the source view

Pulling the file list, source metadata, and syntax-aware rendering for this listing.

Marketplace

Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

152

Skill

n/a

Size

941.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

quota/references/optimization.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown169 linesFree

quota/references/optimization.md

1# Quota Optimization Strategies
2 
3Comprehensive strategies for optimizing Azure AI Foundry quota allocation and reducing costs.
4 
5**Table of Contents:** [1. Identify and Delete Unused Deployments](#1-identify-and-delete-unused-deployments) · [2. Right-Size Over-Provisioned Deployments](#2-right-size-over-provisioned-deployments) · [3. Consolidate Multiple Small Deployments](#3-consolidate-multiple-small-deployments) · [4. Cost Optimization Strategies](#4-cost-optimization-strategies) · [5. Regional Quota Rebalancing](#5-regional-quota-rebalancing)
6 
7## 1. Identify and Delete Unused Deployments
8 
9**Step 1: Discovery with Quota Context**
10 
11Get quota limits FIRST to understand how close you are to capacity:
12 
13```bash
14# Check current quota usage vs limits (run this FIRST)
15subId=$(az account show --query id -o tsv)
16region="eastus"  # Change to your region
17az rest --method get \
18  --url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \
19  --query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit, Available:'(Limit - Used)'}" -o table
20```
21 
22**Step 2: Parallel Deployment Enumeration**
23 
24List all deployments across resources efficiently:
25 
26```bash
27# Get all Foundry resources
28resources=$(az cognitiveservices account list --query "[?kind=='AIServices'].{name:name,rg:resourceGroup}" -o json)
29 
30# Parallel deployment enumeration (faster than sequential)
31echo "$resources" | jq -r '.[] | "\(.name) \(.rg)"' | while read name rg; do
32  echo "=== $name ($rg) ==="
33  az cognitiveservices account deployment list --name "$name" --resource-group "$rg" \
34    --query "[].{Deployment:name,Model:properties.model.name,Capacity:sku.capacity,Created:systemData.createdAt}" -o table &
35done
36wait  # Wait for all background jobs to complete
37```
38 
39**Step 3: Identify Stale Deployments**
40 
41Criteria for deletion candidates:
42 
43- **Test/temporary naming**: Contains "test", "demo", "temp", "dev" in deployment name
44- **Old timestamps**: Created >90 days ago with timestamp-based naming (e.g., "gpt4-20231015")
45- **High capacity consumers**: Deployments with >100K TPM capacity that haven't been referenced in recent logs
46- **Duplicate models**: Multiple deployments of same model/version in same region
47 
48**Example pattern matching for stale deployments:**
49```bash
50# Find deployments with test/temp naming
51az cognitiveservices account deployment list --name <resource> --resource-group <rg> \
52  --query "[?contains(name,'test') || contains(name,'demo') || contains(name,'temp')].{Name:name,Capacity:sku.capacity}" -o table
53```
54 
55**Step 4: Delete and Verify Quota Recovery**
56 
57```bash
58# Delete unused deployment (quota freed IMMEDIATELY)
59az cognitiveservices account deployment delete --name <resource> --resource-group <rg> --deployment-name <deployment>
60 
61# Verify quota freed (re-run Step 1 quota check)
62# You should see "Used" decrease by the deployment's capacity
63```
64 
65**Cost Impact Analysis:**
66 
67| Deployment Type | Capacity (TPM) | Quota Freed | Cost Impact (TPM) | Cost Impact (PTU) |
68|-----------------|----------------|-------------|-------------------|-------------------|
69| Test deployment | 10K TPM | 10K TPM | $0 (pay-per-use) | N/A |
70| Unused production | 100K TPM | 100K TPM | $0 (pay-per-use) | N/A |
71| Abandoned PTU deployment | 100 PTU | ~40K TPM equivalent | $0 TPM | **$3,650/month saved** (100 PTU × 730h × $0.05/h) |
72| High-capacity test | 450K TPM | 450K TPM | $0 (pay-per-use) | N/A |
73 
74**Key Insight:** For TPM (Standard) deployments, deletion frees quota but has no direct cost impact (you pay per token used). For PTU (Provisioned) deployments, deletion **immediately stops hourly charges** and can save thousands per month.
75 
76---
77 
78## 2. Right-Size Over-Provisioned Deployments
79 
80**Identify over-provisioned deployments:**
81- Check Azure Monitor metrics for actual token usage
82- Compare allocated TPM vs. peak usage
83- Look for deployments with <50% utilization
84 
85**Right-sizing example:**
86```bash
87# Update deployment to lower capacity
88az cognitiveservices account deployment update --name <resource> --resource-group <rg> \
89  --deployment-name <deployment> --sku-capacity 30  # Reduce from 50K to 30K TPM
90```
91 
92**Cost Optimization:**
93- **TPM (Standard)**: Reduces regional quota consumption (no direct cost savings, pay-per-token)
94- **PTU (Provisioned)**: Direct cost reduction (40% capacity reduction = 40% cost reduction)
95 
96---
97 
98## 3. Consolidate Multiple Small Deployments
99 
100**Pattern:** Multiple 10K TPM deployments → One 30-50K TPM deployment
101 
102**Benefits:**
103- Fewer deployment slots consumed
104- Simpler management
105- Same total capacity, better utilization
106 
107**Example:**
108- **Before**: 3 deployments @ 10K TPM each = 30K TPM total, 3 slots used
109- **After**: 1 deployment @ 30K TPM = 30K TPM total, 1 slot used
110- **Savings**: 2 deployment slots freed for other models
111 
112---
113 
114## 4. Cost Optimization Strategies
115 
116> **Official Documentation**: [Plan to manage costs for Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs) and [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)
117 
118**A. Use Fine-Tuned Smaller Models** (from [Microsoft Transparency Note](https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/openai/transparency-note)):
119 
120You can reduce costs or latency by swapping a fine-tuned version of a smaller/faster model (e.g., fine-tuned GPT-3.5-Turbo) for a more general-purpose model (e.g., GPT-4).
121 
122```bash
123# Deploy fine-tuned GPT-3.5 Turbo as cost-effective alternative to GPT-4
124az cognitiveservices account deployment create --name <resource> --resource-group <rg> \
125  --deployment-name gpt-35-tuned --model-name <your-fine-tuned-model> \
126  --model-format OpenAI --sku-name Standard --sku-capacity 10
127```
128 
129**B. Remove Unused Fine-Tuned Deployments** (from [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)):
130 
131Fine-tuned model deployments incur **hourly hosting costs** even when not in use. Remove unused deployments promptly to control costs.
132 
133- Inactive deployments unused for **15 consecutive days** are automatically deleted
134- Proactively delete unused fine-tuned deployments to avoid hourly charges
135 
136```bash
137# Delete unused fine-tuned deployment
138az cognitiveservices account deployment delete --name <resource> --resource-group <rg> \
139  --deployment-name <unused-fine-tuned-deployment>
140```
141 
142**C. Batch Multiple Requests** (from [Cost optimization Q&A](https://learn.microsoft.com/en-us/answers/questions/1689253/how-to-optimize-costs-per-request-azure-openai-gpt)):
143 
144Batch multiple requests together to reduce the total number of API calls and lower overall costs.
145 
146**D. Use Commitment Tiers for Predictable Costs** (from [Managing costs guide](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs)):
147 
148- **Pay-as-you-go**: Bills according to usage (variable costs)
149- **Commitment tiers**: Commit to using service features for a fixed fee (predictable costs, potential savings for consistent usage)
150 
151---
152 
153## 5. Regional Quota Rebalancing
154 
155If you have quota spread across multiple regions but only use some:
156 
157```bash
158# Check quota across regions
159for region in eastus westus uksouth; do
160  echo "=== $region ==="
161  subId=$(az account show --query id -o tsv)
162  az rest --method get \
163    --url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \
164    --query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit}" -o table
165done
166```
167 
168**Optimization:** Concentrate deployments in fewer regions to maximize quota utilization per region.
169