Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Analyze Azure subscriptions to find cost savings via orphaned resources, rightsizing, and usage data
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
cost-optimization/azure-aks-anomalies.md
1# AKS Cost Anomaly Investigation23Investigate user-reported cost or utilization spikes by correlating Azure Monitor metrics, scaling events, and Cost Management data.45## Step 1 - Confirm Timeframe67Ask the user: "When did you notice the spike? (e.g., 'last Tuesday', 'between 2 AM and 4 AM yesterday')"89## Step 2 - Pull Cost Data1011```bash12az rest --method post \13--url "https://management.azure.com/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.CostManagement/query?api-version=2023-11-01" \14--headers "ClientType=GitHubCopilotForAzure" \15--body '{16"type": "ActualCost",17"timeframe": "Custom",18"timePeriod": { "from": "<start-date>", "to": "<end-date>" },19"dataset": {20"granularity": "Daily",21"aggregation": { "totalCost": { "name": "Cost", "function": "Sum" } },22"grouping": [{ "type": "Dimension", "name": "ResourceId" }]23}24}'25```2627## Step 3 - Pull Node Count and Scaling Events2829```bash30# First, verify available metrics on your AKS resource31az monitor metrics list-definitions \32--resource "<aks-resource-id>" \33--output table3435# Node count over the anomaly window (use metric name from list-definitions output)36az monitor metrics list \37--resource "<aks-resource-id>" \38--metrics "<verified-node-count-metric>" \39--interval PT5M --aggregation Count \40--start-time "<start-date>" --end-time "<end-date>"4142# HPA scaling events43kubectl get events --all-namespaces \44--field-selector reason=SuccessfulRescale \45--sort-by='.lastTimestamp'46```4748## Step 4 - Top Consumers4950```bash51kubectl top nodes52kubectl top pods --all-namespaces --sort-by=cpu53```5455## Common Causes5657| Symptom | Likely Cause | Action |58|---------|-------------|--------|59| Node count surged off-peak | HPA/VPA misconfiguration | Review HPA min replicas |60| Single pod consuming all CPU | Memory leak or runaway process | Check logs, add resource limits |61| Cost spike on specific day | Batch job ran unexpectedly | Review CronJob schedule |62| Persistent high node count | CAS scale-down blocked | Check PodDisruptionBudgets, system pods |63| Sudden namespace cost jump | New deployment with no resource limits | Add requests/limits |6465## Set Up Budget Alert6667```bash68az consumption budget create \69--budget-name "aks-monthly-budget" \70--amount <budget-amount> \71--time-grain Monthly \72--start-date "<YYYY-MM-01>" \73--end-date "<YYYY-MM-01>" \74--resource-group "<resource-group>" \75--threshold 80 \76--contact-emails "<contact-email>"77```7879