Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
quota/references/optimization.md
1# Quota Optimization Strategies23Comprehensive strategies for optimizing Azure AI Foundry quota allocation and reducing costs.45**Table of Contents:** [1. Identify and Delete Unused Deployments](#1-identify-and-delete-unused-deployments) · [2. Right-Size Over-Provisioned Deployments](#2-right-size-over-provisioned-deployments) · [3. Consolidate Multiple Small Deployments](#3-consolidate-multiple-small-deployments) · [4. Cost Optimization Strategies](#4-cost-optimization-strategies) · [5. Regional Quota Rebalancing](#5-regional-quota-rebalancing)67## 1. Identify and Delete Unused Deployments89**Step 1: Discovery with Quota Context**1011Get quota limits FIRST to understand how close you are to capacity:1213```bash14# Check current quota usage vs limits (run this FIRST)15subId=$(az account show --query id -o tsv)16region="eastus" # Change to your region17az rest --method get \18--url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \19--query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit, Available:'(Limit - Used)'}" -o table20```2122**Step 2: Parallel Deployment Enumeration**2324List all deployments across resources efficiently:2526```bash27# Get all Foundry resources28resources=$(az cognitiveservices account list --query "[?kind=='AIServices'].{name:name,rg:resourceGroup}" -o json)2930# Parallel deployment enumeration (faster than sequential)31echo "$resources" | jq -r '.[] | "\(.name) \(.rg)"' | while read name rg; do32echo "=== $name ($rg) ==="33az cognitiveservices account deployment list --name "$name" --resource-group "$rg" \34--query "[].{Deployment:name,Model:properties.model.name,Capacity:sku.capacity,Created:systemData.createdAt}" -o table &35done36wait # Wait for all background jobs to complete37```3839**Step 3: Identify Stale Deployments**4041Criteria for deletion candidates:4243- **Test/temporary naming**: Contains "test", "demo", "temp", "dev" in deployment name44- **Old timestamps**: Created >90 days ago with timestamp-based naming (e.g., "gpt4-20231015")45- **High capacity consumers**: Deployments with >100K TPM capacity that haven't been referenced in recent logs46- **Duplicate models**: Multiple deployments of same model/version in same region4748**Example pattern matching for stale deployments:**49```bash50# Find deployments with test/temp naming51az cognitiveservices account deployment list --name <resource> --resource-group <rg> \52--query "[?contains(name,'test') || contains(name,'demo') || contains(name,'temp')].{Name:name,Capacity:sku.capacity}" -o table53```5455**Step 4: Delete and Verify Quota Recovery**5657```bash58# Delete unused deployment (quota freed IMMEDIATELY)59az cognitiveservices account deployment delete --name <resource> --resource-group <rg> --deployment-name <deployment>6061# Verify quota freed (re-run Step 1 quota check)62# You should see "Used" decrease by the deployment's capacity63```6465**Cost Impact Analysis:**6667| Deployment Type | Capacity (TPM) | Quota Freed | Cost Impact (TPM) | Cost Impact (PTU) |68|-----------------|----------------|-------------|-------------------|-------------------|69| Test deployment | 10K TPM | 10K TPM | $0 (pay-per-use) | N/A |70| Unused production | 100K TPM | 100K TPM | $0 (pay-per-use) | N/A |71| Abandoned PTU deployment | 100 PTU | ~40K TPM equivalent | $0 TPM | **$3,650/month saved** (100 PTU × 730h × $0.05/h) |72| High-capacity test | 450K TPM | 450K TPM | $0 (pay-per-use) | N/A |7374**Key Insight:** For TPM (Standard) deployments, deletion frees quota but has no direct cost impact (you pay per token used). For PTU (Provisioned) deployments, deletion **immediately stops hourly charges** and can save thousands per month.7576---7778## 2. Right-Size Over-Provisioned Deployments7980**Identify over-provisioned deployments:**81- Check Azure Monitor metrics for actual token usage82- Compare allocated TPM vs. peak usage83- Look for deployments with <50% utilization8485**Right-sizing example:**86```bash87# Update deployment to lower capacity88az cognitiveservices account deployment update --name <resource> --resource-group <rg> \89--deployment-name <deployment> --sku-capacity 30 # Reduce from 50K to 30K TPM90```9192**Cost Optimization:**93- **TPM (Standard)**: Reduces regional quota consumption (no direct cost savings, pay-per-token)94- **PTU (Provisioned)**: Direct cost reduction (40% capacity reduction = 40% cost reduction)9596---9798## 3. Consolidate Multiple Small Deployments99100**Pattern:** Multiple 10K TPM deployments → One 30-50K TPM deployment101102**Benefits:**103- Fewer deployment slots consumed104- Simpler management105- Same total capacity, better utilization106107**Example:**108- **Before**: 3 deployments @ 10K TPM each = 30K TPM total, 3 slots used109- **After**: 1 deployment @ 30K TPM = 30K TPM total, 1 slot used110- **Savings**: 2 deployment slots freed for other models111112---113114## 4. Cost Optimization Strategies115116> **Official Documentation**: [Plan to manage costs for Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs) and [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)117118**A. Use Fine-Tuned Smaller Models** (from [Microsoft Transparency Note](https://learn.microsoft.com/en-us/azure/ai-foundry/responsible-ai/openai/transparency-note)):119120You can reduce costs or latency by swapping a fine-tuned version of a smaller/faster model (e.g., fine-tuned GPT-3.5-Turbo) for a more general-purpose model (e.g., GPT-4).121122```bash123# Deploy fine-tuned GPT-3.5 Turbo as cost-effective alternative to GPT-4124az cognitiveservices account deployment create --name <resource> --resource-group <rg> \125--deployment-name gpt-35-tuned --model-name <your-fine-tuned-model> \126--model-format OpenAI --sku-name Standard --sku-capacity 10127```128129**B. Remove Unused Fine-Tuned Deployments** (from [Fine-tuning cost management](https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/fine-tuning-cost-management)):130131Fine-tuned model deployments incur **hourly hosting costs** even when not in use. Remove unused deployments promptly to control costs.132133- Inactive deployments unused for **15 consecutive days** are automatically deleted134- Proactively delete unused fine-tuned deployments to avoid hourly charges135136```bash137# Delete unused fine-tuned deployment138az cognitiveservices account deployment delete --name <resource> --resource-group <rg> \139--deployment-name <unused-fine-tuned-deployment>140```141142**C. Batch Multiple Requests** (from [Cost optimization Q&A](https://learn.microsoft.com/en-us/answers/questions/1689253/how-to-optimize-costs-per-request-azure-openai-gpt)):143144Batch multiple requests together to reduce the total number of API calls and lower overall costs.145146**D. Use Commitment Tiers for Predictable Costs** (from [Managing costs guide](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/manage-costs)):147148- **Pay-as-you-go**: Bills according to usage (variable costs)149- **Commitment tiers**: Commit to using service features for a fixed fee (predictable costs, potential savings for consistent usage)150151---152153## 5. Regional Quota Rebalancing154155If you have quota spread across multiple regions but only use some:156157```bash158# Check quota across regions159for region in eastus westus uksouth; do160echo "=== $region ==="161subId=$(az account show --query id -o tsv)162az rest --method get \163--url "https://management.azure.com/subscriptions/$subId/providers/Microsoft.CognitiveServices/locations/$region/usages?api-version=2023-05-01" \164--query "value[?contains(name.value,'OpenAI')].{Model:name.value, Used:currentValue, Limit:limit}" -o table165done166```167168**Optimization:** Concentrate deployments in fewer regions to maximize quota utilization per region.169