Analyze Failures — Find and Cluster Failing Traces
Identify failing agent traces, group them by root cause, and produce a prioritized action table.
Step 1 — Find Failing Traces
⚠️ Hosted agents:
gen_ai.agent.nameondependenciesholds the code-level class name (e.g.,BingSearchAgent), NOT the Foundry agent name. To filter by Foundry name, use the Hosted Agent Variant below.
dependencies
| where timestamp > ago(24h)
| where success == false or toint(resultCode) >= 400
| extend
operation = tostring(customDimensions["gen_ai.operation.name"]),
errorType = tostring(customDimensions["error.type"]),
model = tostring(customDimensions["gen_ai.request.model"]),
agentName = tostring(customDimensions["gen_ai.agent.name"]),
conversationId = tostring(customDimensions["gen_ai.conversation.id"])
| project timestamp, name, duration, resultCode, errorType, operation, model,
agentName, conversationId, operation_Id, id
| order by timestamp desc
| take 100Step 2 — Cluster by Error Type
dependencies
| where timestamp > ago(24h)
| where success == false or toint(resultCode) >= 400
| extend
errorType = tostring(customDimensions["error.type"]),
operation = tostring(customDimensions["gen_ai.operation.name"])
| summarize
count = count(),
firstSeen = min(timestamp),
lastSeen = max(timestamp),
avgDuration = avg(duration),
sampleOperationId = take_any(operation_Id)
by errorType, operation, resultCode
| order by count descStep 3 — Prioritized Action Table
Present results as:
| Priority | Error Type | Operation | Count | Result Code | Suggested Action |
|---|---|---|---|---|---|
| P0 | timeout | invoke_agent | 15 | 504 | Check agent container health, increase timeout |
| P1 | rate_limited | chat | 8 | 429 | Check quota, add retry logic |
| P2 | content_filter | chat | 5 | 400 | Review prompt for policy violations |
| P3 | tool_error | execute_tool | 3 | 500 | Check tool implementation and permissions |
Prioritization: P0 = highest count or most severe (5xx), then by count × recency.
Step 4 — Drill Into Specific Failure
When the user selects a cluster, show individual failing traces:
dependencies
| where timestamp > ago(24h)
| where success == false
| where customDimensions["error.type"] == "<selected_error_type>"
| where customDimensions["gen_ai.operation.name"] == "<selected_operation>"
| project timestamp, name, duration, resultCode,
conversationId = tostring(customDimensions["gen_ai.conversation.id"]),
responseId = tostring(customDimensions["gen_ai.response.id"]),
operation_Id
| order by timestamp desc
| take 20Also check exceptions table for stack traces:
exceptions
| where timestamp > ago(24h)
| where operation_Id in ("<operation_id_1>", "<operation_id_2>")
| project timestamp, type, message, outerMessage, details, operation_Id
| order by timestamp descOffer to view the full conversation for any trace via Conversation Detail.
Hosted Agent Variant — Failures
For hosted agents, the Foundry agent name lives on requests, not dependencies. Use a two-step join:
let reqIds = requests
| where timestamp > ago(24h)
| where customDimensions["gen_ai.agent.name"] == "<foundry-agent-name>"
| distinct id;
dependencies
| where timestamp > ago(24h)
| where operation_ParentId in (reqIds)
| where success == false or toint(resultCode) >= 400
| extend
operation = tostring(customDimensions["gen_ai.operation.name"]),
errorType = tostring(customDimensions["error.type"]),
model = tostring(customDimensions["gen_ai.request.model"]),
conversationId = tostring(customDimensions["gen_ai.conversation.id"])
| project timestamp, name, duration, resultCode, errorType, operation, model,
conversationId, operation_ParentId, operation_Id
| order by timestamp desc
| take 100