Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Configure Azure API Management as an AI Gateway with caching, token limits, and content safety
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
references/policies.md
1# AI Gateway Policies23Complete reference for Azure API Management AI governance policies.45---67## Policy Placement Order89Recommended order in `<inbound>` section:1011```121. Authentication (managed identity)132. Semantic Cache Lookup143. Token Rate Limiting154. Content Safety165. Backend Selection / Load Balancing176. Token Metrics18```1920---2122## Model Policies2324### Token Rate Limiting2526Control costs by limiting token consumption per minute.2728```xml29<azure-openai-token-limit30tokens-per-minute="50000"31counter-key="@(context.Subscription.Id)"32estimate-prompt-tokens="true"33tokens-consumed-header-name="x-tokens-consumed"34remaining-tokens-header-name="x-tokens-remaining" />35```3637| Attribute | Purpose | Default |38|-----------|---------|---------|39| `tokens-per-minute` | Max tokens per counter window | Required |40| `counter-key` | Grouping key (subscription, IP, custom) | Required |41| `estimate-prompt-tokens` | Count prompt tokens toward limit | `true` |42| `tokens-consumed-header-name` | Response header with consumed count | — |43| `remaining-tokens-header-name` | Response header with remaining count | — |4445**Usage tiers example:**4647```xml48<!-- Free tier: 5K TPM -->49<azure-openai-token-limit tokens-per-minute="5000"50counter-key="@("free-" + context.Subscription.Id)"51estimate-prompt-tokens="true" />5253<!-- Premium tier: 100K TPM -->54<azure-openai-token-limit tokens-per-minute="100000"55counter-key="@("premium-" + context.Subscription.Id)"56estimate-prompt-tokens="true" />57```5859---6061### Semantic Caching6263Cache AI responses for semantically similar prompts. Saves 60-80% on repeated queries.6465**Lookup** (in `<inbound>`):6667```xml68<azure-openai-semantic-cache-lookup69score-threshold="0.8"70embeddings-backend-id="embeddings-backend"71embeddings-backend-auth="system-assigned" />72```7374**Store** (in `<outbound>`):7576```xml77<azure-openai-semantic-cache-store duration="3600" />78```7980| Attribute | Purpose | Recommended |81|-----------|---------|-------------|82| `score-threshold` | Similarity threshold (0-1) | 0.8 (lower = more cache hits) |83| `embeddings-backend-id` | Backend for embedding generation | Required |84| `embeddings-backend-auth` | Auth to embeddings backend | `system-assigned` |85| `duration` | Cache TTL in seconds | 3600 (1 hour) |8687**Prerequisites:**88- An embeddings model deployed (e.g., `text-embedding-ada-002`)89- A separate backend pointing to the embeddings endpoint90- Azure Cache for Redis Enterprise with RediSearch module (for vector storage)9192```bash93# Create embeddings backend94az apim backend create --service-name <apim> --resource-group <rg> \95--backend-id embeddings-backend --protocol http \96--url "https://<aoai>.openai.azure.com/openai"97```9899> **Note**: Semantic caching is NOT compatible with streaming responses (`"stream": true`).100101---102103### Token Metrics104105Emit token usage metrics for monitoring and chargeback.106107```xml108<azure-openai-emit-token-metric namespace="ai-gateway">109<dimension name="Subscription" value="@(context.Subscription.Id)" />110<dimension name="API" value="@(context.Api.Name)" />111<dimension name="Model" value="@(context.Request.Headers.GetValueOrDefault("x-model", "unknown"))" />112<dimension name="Operation" value="@(context.Operation.Id)" />113</azure-openai-emit-token-metric>114```115116Emits to Azure Monitor with these metrics:117- `Total Tokens` — prompt + completion combined118- `Prompt Tokens` — input tokens119- `Completion Tokens` — output tokens120121**Query token usage (KQL):**122123```kql124customMetrics125| where name == "Total Tokens"126| extend Subscription = tostring(customDimensions["Subscription"])127| summarize TotalTokens = sum(value) by Subscription, bin(timestamp, 1h)128| order by TotalTokens desc129```130131---132133## Agent Policies134135### Content Safety136137Filter harmful, violent, or inappropriate content from AI inputs and outputs.138139```xml140<!-- In <inbound> -->141<llm-content-safety backend-id="contentsafety-backend">142<category name="Hate" threshold="4" />143<category name="Sexual" threshold="4" />144<category name="SelfHarm" threshold="4" />145<category name="Violence" threshold="4" />146</llm-content-safety>147```148149| Category | Description | Threshold Range |150|----------|-------------|-----------------|151| Hate | Discrimination, slurs | 0 (block all) - 6 (allow most) |152| Sexual | Explicit content | 0-6 |153| SelfHarm | Self-injury content | 0-6 |154| Violence | Violent content | 0-6 |155156**Prerequisites:**157- Azure AI Content Safety resource deployed158- Backend configured for the Content Safety endpoint:159160```bash161az apim backend create --service-name <apim> --resource-group <rg> \162--backend-id contentsafety-backend --protocol http \163--url "https://<contentsafety>.cognitiveservices.azure.com"164```165166---167168### Jailbreak Detection169170Block prompt injection attacks that attempt to bypass AI safety guardrails.171172```xml173<llm-content-safety backend-id="contentsafety-backend">174<category name="Hate" threshold="4" />175<category name="Sexual" threshold="4" />176<category name="SelfHarm" threshold="4" />177<category name="Violence" threshold="4" />178<!-- Jailbreak detection is automatic when content safety is enabled -->179</llm-content-safety>180```181182Custom response for blocked content:183184```xml185<on-error>186<base />187<choose>188<when condition="@(context.LastError.Source == "llm-content-safety")">189<return-response>190<set-status code="400" reason="Content Filtered" />191<set-body>{"error": "Request blocked by content safety policy"}</set-body>192</return-response>193</when>194</choose>195</on-error>196```197198---199200## Tool Policies201202### Request Rate Limiting203204Protect MCP tools and API endpoints from excessive requests.205206```xml207<!-- Per-agent rate limiting -->208<rate-limit-by-key calls="30" renewal-period="60"209counter-key="@(context.Request.Headers.GetValueOrDefault("X-Agent-Id", "anonymous"))"210remaining-calls-header-name="x-ratelimit-remaining"211retry-after-header-name="Retry-After" />212```213214```xml215<!-- Per-subscription rate limiting -->216<rate-limit-by-key calls="100" renewal-period="60"217counter-key="@(context.Subscription.Id)" />218```219220---221222## Combining Policies223224Complete inbound policy example with all governance layers:225226```xml227<policies>228<inbound>229<base />230231<!-- 1. Authentication -->232<authentication-managed-identity resource="https://cognitiveservices.azure.com" />233234<!-- 2. Semantic Cache Lookup -->235<azure-openai-semantic-cache-lookup236score-threshold="0.8"237embeddings-backend-id="embeddings-backend"238embeddings-backend-auth="system-assigned" />239240<!-- 3. Token Rate Limiting -->241<azure-openai-token-limit242tokens-per-minute="50000"243counter-key="@(context.Subscription.Id)"244estimate-prompt-tokens="true" />245246<!-- 4. Content Safety -->247<llm-content-safety backend-id="contentsafety-backend">248<category name="Hate" threshold="4" />249<category name="Sexual" threshold="4" />250<category name="SelfHarm" threshold="4" />251<category name="Violence" threshold="4" />252</llm-content-safety>253254<!-- 5. Backend Selection -->255<set-backend-service backend-id="openai-backend" />256257<!-- 6. Token Metrics -->258<azure-openai-emit-token-metric namespace="ai-gateway">259<dimension name="Subscription" value="@(context.Subscription.Id)" />260<dimension name="API" value="@(context.Api.Name)" />261</azure-openai-emit-token-metric>262</inbound>263264<backend>265<forward-request timeout="120" />266</backend>267268<outbound>269<base />270<!-- Cache store (after successful response) -->271<azure-openai-semantic-cache-store duration="3600" />272</outbound>273274<on-error>275<base />276<choose>277<when condition="@(context.LastError.Source == "azure-openai-token-limit")">278<return-response>279<set-status code="429" reason="Token Limit Exceeded" />280<set-header name="Retry-After" exists-action="override">281<value>60</value>282</set-header>283<set-body>{"error": "Token rate limit exceeded. Try again later."}</set-body>284</return-response>285</when>286</choose>287</on-error>288</policies>289```290291---292293## Policy Quick-Decision Table294295| Need | Policy | Section |296|------|--------|---------|297| Control token spend | `azure-openai-token-limit` | `<inbound>` |298| Cache similar prompts | `azure-openai-semantic-cache-lookup/store` | `<inbound>` / `<outbound>` |299| Track token usage | `azure-openai-emit-token-metric` | `<inbound>` |300| Block harmful content | `llm-content-safety` | `<inbound>` |301| Rate limit API calls | `rate-limit-by-key` | `<inbound>` |302| Authenticate to backend | `authentication-managed-identity` | `<inbound>` |303| Load balance backends | `set-backend-service` + retry | `<inbound>` |304305---306307## References308309- [GenAI Gateway Capabilities](https://learn.microsoft.com/azure/api-management/genai-gateway-capabilities)310- [APIM Policy Reference](https://learn.microsoft.com/azure/api-management/api-management-policies)311- [AI-Gateway Samples](https://github.com/Azure-Samples/AI-Gateway)312