Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
quota/references/optimization.md
1# Quota Optimization Strategies23Comprehensive strategies for optimizing Azure AI Foundry quota allocation and reducing costs.45**Table of Contents:** [1. Identify and Delete Unused Deployments](#1-identify-and-delete-unused-deployments) · [2. Right-Size Over-Provisioned Deployments](#2-right-size-over-provisioned-deployments) · [3. Consolidate Multiple Small Deployments](#3-consolidate-multiple-small-deployments) · [4. Cost Optimization Strategies](#4-cost-optimization-strategies) · [5. Regional Quota Rebalancing](#5-regional-quota-rebalancing)67## 1. Identify and Delete Unused Deployments89**Step 1: Discovery with Quota Context**1011Get quota limits FIRST to understand how close you are to capacity:1213```bash14# Check current quota usage vs limits (run this FIRST)15subId=$(az account show --query id -o tsv)16region="eastus" # Change to your region17az rest --method get \18--url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \19--query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit, Available:'(Limit - Used)'}" -o table20```2122**Step 2: Parallel Deployment Enumeration**2324List all deployments across resources efficiently:2526```bash27# Get all Foundry resources28resources=$(az cognitiveservices account list --query "[?kind=='AIServices'].{name:name,rg:resourceGroup}" -o json)2930# Parallel deployment enumeration (faster than sequential)31echo "$resources" | jq -r '.[] | "\(.name) \(.rg)"' | while read name rg; do32echo "=== $name ($rg) ==="33az cognitiveservices account deployment list --name "$name" --resource-group "$rg" \34--query "[].{Deployment:name,Model:properties.model.name,Capacity:sku.capacity,Created:systemData.createdAt}" -o table &35done36wait # Wait for all background jobs to complete37```3839**Step 3: Identify Stale Deployments**4041Criteria for deletion candidates:4243- **Test/temporary naming**: Contains "test", "demo", "temp", "dev" in deployment name44- **Old timestamps**: Created >90 days ago with timestamp-based naming (e.g., "gpt4-20231015")45- **High capacity consumers**: Deployments with >100K TPM capacity that haven't been referenced in recent logs46- **Duplicate models**: Multiple deployments of same model/version in same region4748**Example pattern matching for stale deployments:**49```bash50# Find deployments with test/temp naming51az cognitiveservices account deployment list --name <resource> --resource-group <rg> \52--query "[?contains(name,'test') || contains(name,'demo') || contains(name,'temp')].{Name:name,Capacity:sku.capacity}" -o table53```5455**Step 4: Delete and Verify Quota Recovery**5657```bash58# Delete unused deployment (quota freed IMMEDIATELY)59az cognitiveservices account deployment delete --name <resource> --resource-group <rg> --deployment-name <deployment>6061# Verify quota freed (re-run Step 1 quota check)62# You should see "Used" decrease by the deployment's capacity63```6465**Cost Impact Analysis:**6667| Deployment Type | Capacity (TPM) | Quota Freed | Cost Impact (TPM) | Cost Impact (PTU) |68|-----------------|----------------|-------------|-------------------|-------------------|69| Test deployment | 10K TPM | 10K TPM | $0 (pay-per-use) | N/A |70| Unused production | 100K TPM | 100K TPM | $0 (pay-per-use) | N/A |71| Abandoned PTU deployment | 100 PTU | ~40K TPM equivalent | $0 TPM | **$3,650/month saved** (100 PTU × 730h × $0.05/h) |72| High-capacity test | 450K TPM | 450K TPM | $0 (pay-per-use) | N/A |7374**Key Insight:** For TPM (Standard) deployments, deletion frees quota but has no direct cost impact (you pay per token used). For PTU (Provisioned) deployments, deletion **immediately stops hourly charges** and can save thousands per month.7576---7778## 2. Right-Size Over-Provisioned Deployments7980**Identify over-provisioned deployments:**81- Check Azure Monitor metrics for actual token usage82- Compare allocated TPM vs. peak usage83- Look for deployments with <50% utilization8485**Right-sizing example:**86```bash87# Update deployment to lower capacity88az cognitiveservices account deployment update --name <resource> --resource-group <rg> \89--deployment-name <deployment> --sku-capacity 30 # Reduce from 50K to 30K TPM90```9192**Cost Optimization:**93- **TPM (Standard)**: Reduces regional quota consumption (no direct cost savings, pay-per-token)94- **PTU (Provisioned)**: Direct cost reduction (40% capacity reduction = 40% cost reduction)9596---9798## 3. Consolidate Multiple Small Deployments99100**Pattern:** Multiple 10K TPM deployments → One 30-50K TPM deployment101102**Benefits:**103- Fewer deployment slots consumed104- Simpler management105- Same total capacity, better utilization106107**Example:**108- **Before**: 3 deployments @ 10K TPM each = 30K TPM total, 3 slots used109- **After**: 1 deployment @ 30K TPM = 30K TPM total, 1 slot used110- **Savings**: 2 deployment slots freed for other models111112---113114## 4. Cost Optimization Strategies115116> **Official Documentation**: [Plan to manage costs for Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs) and [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)117118**A. Use Fine-Tuned Smaller Models** (from [Microsoft Transparency Note](https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/openai/transparency-note)):119120You can reduce costs or latency by swapping a fine-tuned version of a smaller/faster model (e.g., fine-tuned GPT-3.5-Turbo) for a more general-purpose model (e.g., GPT-4).121122```bash123# Deploy fine-tuned GPT-3.5 Turbo as cost-effective alternative to GPT-4124az cognitiveservices account deployment create --name <resource> --resource-group <rg> \125--deployment-name gpt-35-tuned --model-name <your-fine-tuned-model> \126--model-format OpenAI --sku-name Standard --sku-capacity 10127```128129**B. Remove Unused Fine-Tuned Deployments** (from [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)):130131Fine-tuned model deployments incur **hourly hosting costs** even when not in use. Remove unused deployments promptly to control costs.132133- Inactive deployments unused for **15 consecutive days** are automatically deleted134- Proactively delete unused fine-tuned deployments to avoid hourly charges135136```bash137# Delete unused fine-tuned deployment138az cognitiveservices account deployment delete --name <resource> --resource-group <rg> \139--deployment-name <unused-fine-tuned-deployment>140```141142**C. Batch Multiple Requests** (from [Cost optimization Q&A](https://learn.microsoft.com/en-us/answers/questions/1689253/how-to-optimize-costs-per-request-azure-openai-gpt)):143144Batch multiple requests together to reduce the total number of API calls and lower overall costs.145146**D. Use Commitment Tiers for Predictable Costs** (from [Managing costs guide](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs)):147148- **Pay-as-you-go**: Bills according to usage (variable costs)149- **Commitment tiers**: Commit to using service features for a fixed fee (predictable costs, potential savings for consistent usage)150151---152153## 5. Regional Quota Rebalancing154155If you have quota spread across multiple regions but only use some:156157```bash158# Check quota across regions159for region in eastus westus uksouth; do160echo "=== $region ==="161subId=$(az account show --query id -o tsv)162az rest --method get \163--url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \164--query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit}" -o table165done166```167168**Optimization:** Concentrate deployments in fewer regions to maximize quota utilization per region.169