Source from repo

Azure AI Gateway

Configure Azure API Management as an AI Gateway with caching, token limits, and content safety

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

39.4 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

references/policies.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown312 linesFree

references/policies.md

1# AI Gateway Policies
2 
3Complete reference for Azure API Management AI governance policies.
4 
5---
6 
7## Policy Placement Order
8 
9Recommended order in `<inbound>` section:
10 
11```
121. Authentication (managed identity)
132. Semantic Cache Lookup
143. Token Rate Limiting
154. Content Safety
165. Backend Selection / Load Balancing
176. Token Metrics
18```
19 
20---
21 
22## Model Policies
23 
24### Token Rate Limiting
25 
26Control costs by limiting token consumption per minute.
27 
28```xml
29<azure-openai-token-limit
30    tokens-per-minute="50000"
31    counter-key="@(context.Subscription.Id)"
32    estimate-prompt-tokens="true"
33    tokens-consumed-header-name="x-tokens-consumed"
34    remaining-tokens-header-name="x-tokens-remaining" />
35```
36 
37| Attribute | Purpose | Default |
38|-----------|---------|---------|
39| `tokens-per-minute` | Max tokens per counter window | Required |
40| `counter-key` | Grouping key (subscription, IP, custom) | Required |
41| `estimate-prompt-tokens` | Count prompt tokens toward limit | `true` |
42| `tokens-consumed-header-name` | Response header with consumed count | — |
43| `remaining-tokens-header-name` | Response header with remaining count | — |
44 
45**Usage tiers example:**
46 
47```xml
48<!-- Free tier: 5K TPM -->
49<azure-openai-token-limit tokens-per-minute="5000"
50    counter-key="@("free-" + context.Subscription.Id)"
51    estimate-prompt-tokens="true" />
52 
53<!-- Premium tier: 100K TPM -->
54<azure-openai-token-limit tokens-per-minute="100000"
55    counter-key="@("premium-" + context.Subscription.Id)"
56    estimate-prompt-tokens="true" />
57```
58 
59---
60 
61### Semantic Caching
62 
63Cache AI responses for semantically similar prompts. Saves 60-80% on repeated queries.
64 
65**Lookup** (in `<inbound>`):
66 
67```xml
68<azure-openai-semantic-cache-lookup
69    score-threshold="0.8"
70    embeddings-backend-id="embeddings-backend"
71    embeddings-backend-auth="system-assigned" />
72```
73 
74**Store** (in `<outbound>`):
75 
76```xml
77<azure-openai-semantic-cache-store duration="3600" />
78```
79 
80| Attribute | Purpose | Recommended |
81|-----------|---------|-------------|
82| `score-threshold` | Similarity threshold (0-1) | 0.8 (lower = more cache hits) |
83| `embeddings-backend-id` | Backend for embedding generation | Required |
84| `embeddings-backend-auth` | Auth to embeddings backend | `system-assigned` |
85| `duration` | Cache TTL in seconds | 3600 (1 hour) |
86 
87**Prerequisites:**
88- An embeddings model deployed (e.g., `text-embedding-ada-002`)
89- A separate backend pointing to the embeddings endpoint
90- Azure Cache for Redis Enterprise with RediSearch module (for vector storage)
91 
92```bash
93# Create embeddings backend
94az apim backend create --service-name <apim> --resource-group <rg> \
95  --backend-id embeddings-backend --protocol http \
96  --url "https://<aoai>.openai.azure.com/openai"
97```
98 
99> **Note**: Semantic caching is NOT compatible with streaming responses (`"stream": true`).
100 
101---
102 
103### Token Metrics
104 
105Emit token usage metrics for monitoring and chargeback.
106 
107```xml
108<azure-openai-emit-token-metric namespace="ai-gateway">
109    <dimension name="Subscription" value="@(context.Subscription.Id)" />
110    <dimension name="API" value="@(context.Api.Name)" />
111    <dimension name="Model" value="@(context.Request.Headers.GetValueOrDefault("x-model", "unknown"))" />
112    <dimension name="Operation" value="@(context.Operation.Id)" />
113</azure-openai-emit-token-metric>
114```
115 
116Emits to Azure Monitor with these metrics:
117- `Total Tokens` — prompt + completion combined
118- `Prompt Tokens` — input tokens
119- `Completion Tokens` — output tokens
120 
121**Query token usage (KQL):**
122 
123```kql
124customMetrics
125| where name == "Total Tokens"
126| extend Subscription = tostring(customDimensions["Subscription"])
127| summarize TotalTokens = sum(value) by Subscription, bin(timestamp, 1h)
128| order by TotalTokens desc
129```
130 
131---
132 
133## Agent Policies
134 
135### Content Safety
136 
137Filter harmful, violent, or inappropriate content from AI inputs and outputs.
138 
139```xml
140<!-- In <inbound> -->
141<llm-content-safety backend-id="contentsafety-backend">
142    <category name="Hate" threshold="4" />
143    <category name="Sexual" threshold="4" />
144    <category name="SelfHarm" threshold="4" />
145    <category name="Violence" threshold="4" />
146</llm-content-safety>
147```
148 
149| Category | Description | Threshold Range |
150|----------|-------------|-----------------|
151| Hate | Discrimination, slurs | 0 (block all) - 6 (allow most) |
152| Sexual | Explicit content | 0-6 |
153| SelfHarm | Self-injury content | 0-6 |
154| Violence | Violent content | 0-6 |
155 
156**Prerequisites:**
157- Azure AI Content Safety resource deployed
158- Backend configured for the Content Safety endpoint:
159 
160```bash
161az apim backend create --service-name <apim> --resource-group <rg> \
162  --backend-id contentsafety-backend --protocol http \
163  --url "https://<contentsafety>.cognitiveservices.azure.com"
164```
165 
166---
167 
168### Jailbreak Detection
169 
170Block prompt injection attacks that attempt to bypass AI safety guardrails.
171 
172```xml
173<llm-content-safety backend-id="contentsafety-backend">
174    <category name="Hate" threshold="4" />
175    <category name="Sexual" threshold="4" />
176    <category name="SelfHarm" threshold="4" />
177    <category name="Violence" threshold="4" />
178    <!-- Jailbreak detection is automatic when content safety is enabled -->
179</llm-content-safety>
180```
181 
182Custom response for blocked content:
183 
184```xml
185<on-error>
186    <base />
187    <choose>
188        <when condition="@(context.LastError.Source == "llm-content-safety")">
189            <return-response>
190                <set-status code="400" reason="Content Filtered" />
191                <set-body>{"error": "Request blocked by content safety policy"}</set-body>
192            </return-response>
193        </when>
194    </choose>
195</on-error>
196```
197 
198---
199 
200## Tool Policies
201 
202### Request Rate Limiting
203 
204Protect MCP tools and API endpoints from excessive requests.
205 
206```xml
207<!-- Per-agent rate limiting -->
208<rate-limit-by-key calls="30" renewal-period="60"
209    counter-key="@(context.Request.Headers.GetValueOrDefault("X-Agent-Id", "anonymous"))"
210    remaining-calls-header-name="x-ratelimit-remaining"
211    retry-after-header-name="Retry-After" />
212```
213 
214```xml
215<!-- Per-subscription rate limiting -->
216<rate-limit-by-key calls="100" renewal-period="60"
217    counter-key="@(context.Subscription.Id)" />
218```
219 
220---
221 
222## Combining Policies
223 
224Complete inbound policy example with all governance layers:
225 
226```xml
227<policies>
228    <inbound>
229        <base />
230 
231        <!-- 1. Authentication -->
232        <authentication-managed-identity resource="https://cognitiveservices.azure.com" />
233 
234        <!-- 2. Semantic Cache Lookup -->
235        <azure-openai-semantic-cache-lookup
236            score-threshold="0.8"
237            embeddings-backend-id="embeddings-backend"
238            embeddings-backend-auth="system-assigned" />
239 
240        <!-- 3. Token Rate Limiting -->
241        <azure-openai-token-limit
242            tokens-per-minute="50000"
243            counter-key="@(context.Subscription.Id)"
244            estimate-prompt-tokens="true" />
245 
246        <!-- 4. Content Safety -->
247        <llm-content-safety backend-id="contentsafety-backend">
248            <category name="Hate" threshold="4" />
249            <category name="Sexual" threshold="4" />
250            <category name="SelfHarm" threshold="4" />
251            <category name="Violence" threshold="4" />
252        </llm-content-safety>
253 
254        <!-- 5. Backend Selection -->
255        <set-backend-service backend-id="openai-backend" />
256 
257        <!-- 6. Token Metrics -->
258        <azure-openai-emit-token-metric namespace="ai-gateway">
259            <dimension name="Subscription" value="@(context.Subscription.Id)" />
260            <dimension name="API" value="@(context.Api.Name)" />
261        </azure-openai-emit-token-metric>
262    </inbound>
263 
264    <backend>
265        <forward-request timeout="120" />
266    </backend>
267 
268    <outbound>
269        <base />
270        <!-- Cache store (after successful response) -->
271        <azure-openai-semantic-cache-store duration="3600" />
272    </outbound>
273 
274    <on-error>
275        <base />
276        <choose>
277            <when condition="@(context.LastError.Source == "azure-openai-token-limit")">
278                <return-response>
279                    <set-status code="429" reason="Token Limit Exceeded" />
280                    <set-header name="Retry-After" exists-action="override">
281                        <value>60</value>
282                    </set-header>
283                    <set-body>{"error": "Token rate limit exceeded. Try again later."}</set-body>
284                </return-response>
285            </when>
286        </choose>
287    </on-error>
288</policies>
289```
290 
291---
292 
293## Policy Quick-Decision Table
294 
295| Need | Policy | Section |
296|------|--------|---------|
297| Control token spend | `azure-openai-token-limit` | `<inbound>` |
298| Cache similar prompts | `azure-openai-semantic-cache-lookup/store` | `<inbound>` / `<outbound>` |
299| Track token usage | `azure-openai-emit-token-metric` | `<inbound>` |
300| Block harmful content | `llm-content-safety` | `<inbound>` |
301| Rate limit API calls | `rate-limit-by-key` | `<inbound>` |
302| Authenticate to backend | `authentication-managed-identity` | `<inbound>` |
303| Load balance backends | `set-backend-service` + retry | `<inbound>` |
304 
305---
306 
307## References
308 
309- [GenAI Gateway Capabilities](https://learn.microsoft.com/azure/api-management/genai-gateway-capabilities)
310- [APIM Policy Reference](https://learn.microsoft.com/azure/api-management/api-management-policies)
311- [AI-Gateway Samples](https://github.com/Azure-Samples/AI-Gateway)
312

Preparing the source view

Azure AI Gateway

references/policies.md