Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

560.1 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/observe/references/analyze-results.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown136 linesFree

foundry-agent/observe/references/analyze-results.md

1# Steps 3–5 — Download Results, Cluster Failures, Dive Into Category
2 
3## Step 3 — Download Results
4 
5`evaluation_get` returns run metadata but **not** full per-row output. Write a Python script (save to `scripts/`) to download detailed results using the **Azure AI Projects Python SDK**.
6 
7### Prerequisites
8 
9```text
10pip install azure-ai-projects>=2.0.0 azure-identity
11```
12 
13### SDK Client Setup
14 
15```python
16from azure.identity import DefaultAzureCredential
17from azure.ai.projects import AIProjectClient
18 
19project_client = AIProjectClient(
20    endpoint=project_endpoint,       # e.g. "https://<hub>.services.ai.azure.com/api/projects/<project>"
21    credential=DefaultAzureCredential(),
22)
23# The evals API lives on the OpenAI sub-client, not on AIProjectClient directly
24client = project_client.get_openai_client()
25```
26 
27> ⚠️ **Common mistake:** Calling `project_client.evals` directly — the `evals` namespace is on the OpenAI client returned by `get_openai_client()`, not on `AIProjectClient` itself.
28 
29### Retrieve Run Status
30 
31```python
32run = client.evals.runs.retrieve(run_id=run_id, eval_id=eval_id)
33print(f"Status: {run.status}  Report: {run.report_url}")
34```
35 
36### Download Per-Row Output Items
37 
38The SDK handles pagination automatically — no manual `has_more` / `after` loop required.
39 
40```python
41output_items = list(client.evals.runs.output_items.list(run_id=run_id, eval_id=eval_id))
42all_items = [item.model_dump() for item in output_items]
43```
44 
45> 💡 **Tip:** Use `model_dump()` to convert each SDK object to a plain dict for JSON serialization.
46 
47### Data Structure
48 
49Query/response data lives in `datasource_item.query` and `datasource_item['sample.output_text']`, **not** in `sample.input`/`sample.output` (which are empty arrays). Parse `datasource_item` fields when extracting queries and responses for analysis.
50 
51> ⚠️ **LLM judge knowledge cutoff:** When evaluating agents that use real-time data sources (web search, Bing Grounding, live APIs), the LLM judge may flag factually correct but temporally recent responses as "fabricated" or "unverifiable" because the judge's training data predates the agent's live results. Check failure reasons for phrases like "cannot verify," "beyond knowledge cutoff," or "no evidence" before treating them as real failures. See Behavioral Rule 13 in `observe.md` for mitigations.
52 
53### Custom Evaluator Dual-Entry Parsing
54 
55Custom evaluators produce **two** result entries per item in the `results` array:
56 
57| Entry | `metric` field | Has score? | Has reason/label/passed? |
58|-------|----------------|------------|--------------------------|
59| Entry 1 | `"custom_score"` | ✅ numeric score | ❌ null |
60| Entry 2 | `"{evaluator_name}"` | ❌ null | ✅ real reason, label, passed |
61 
62To get the complete picture, merge both entries:
63 
64```python
65def extract_evaluator_result(item, evaluator_name):
66    """Merge the dual entries for a custom evaluator into one result."""
67    score_entry = None
68    detail_entry = None
69    for r in item.get("results", []):
70        metric = r.get("metric", "")
71        if metric == "custom_score":
72            score_entry = r
73        elif metric == evaluator_name:
74            detail_entry = r
75    if not detail_entry:
76        return None
77    return {
78        "score": score_entry.get("score") if score_entry else None,
79        "passed": detail_entry.get("passed"),
80        "reason": detail_entry.get("reason"),
81        "label": detail_entry.get("label"),
82    }
83```
84 
85> ⚠️ **Common mistake:** Reading only the first matching result entry for a custom evaluator gives you the score but null reason (or vice versa). Always merge both entries. Built-in evaluators do **not** have this dual-entry pattern - they produce a single entry with all fields populated.
86 
87**Evidence from actual eval run** (item 1, `behavioral_adherence`):
88 
89```jsonc
90// Entry 1: has score, null reason
91{"name": "behavioral_adherence", "metric": "custom_score", "score": 1, "reason": null, "passed": null}
92 
93// Entry 2: has reason, null score
94{"name": "behavioral_adherence", "metric": "behavioral_adherence", "score": null,
95 "reason": "The response provides outdated and fabricated information...", "passed": false}
96```
97 
98### Persist Results
99 
100Save results to `.foundry/results/<environment>/<eval-id>/<run-id>.json` (use `json.dump` with `default=str` for non-serializable fields). Print summary: total items, passed, failed, errored counts.
101 
102> ⚠️ **Show Data Viewer deeplinks (for VS Code runtime only):** Append a Data Viewer deeplink immediately after reference to an evaluation result file in your response. Format: "[Open in Data Viewer](vscode://ms-windows-ai-studio.windows-ai-studio/open_data_viewer?file=<file_path>&source=microsoft-foundry-skill) for details and perform analysis".
103 
104## Step 4 — Cluster Failures by Root Cause
105 
106Analyze every row in the results. Group failures into clusters:
107 
108| Cluster | Description |
109|---------|-------------|
110| Incorrect / hallucinated answer | Agent gave a wrong or fabricated response |
111| Incomplete answer | Agent missed key parts |
112| Tool call failure | Agent failed to invoke or misused a tool |
113| Safety / content violation | Flagged by safety evaluators |
114| Runtime error | Agent crashed or returned an error |
115| Off-topic / refusal | Agent refused or went off-topic |
116 
117Produce a prioritized action table:
118 
119| Focus | Cluster | Suggested Action |
120|-------|---------|------------------|
121| Runtime blockers | Runtime errors or failing suites tagged `tier=smoke` | Check container logs or fix blockers first |
122| Key regressions | Incorrect answers on suites tagged `purpose=regression` or `tier=smoke` | Optimize prompt or tool instructions |
123| Broader quality gaps | Incomplete answers or coverage-oriented suites | Optimize prompt or expand context |
124| Tooling issues | Tool call failures | Fix tool definitions or instructions |
125| Safety issues | Safety violations | Add guardrails to instructions |
126 
127**Rule:** Prioritize runtime errors first, then suites tagged `tier=smoke`, then suites tagged `purpose=regression`, then broader coverage suites by count × severity.
128 
129## Step 5 — Dive Into Category
130 
131When the user wants to inspect a specific cluster, display the individual rows: evaluation-suite ID, input query, the agent's original response, evaluator scores, and failure reason. Let the user confirm which category or evaluation suite to optimize.
132 
133## Next Steps
134 
135After clustering -> proceed to [Step 6: Optimize Prompt](optimize-deploy.md).
136

Marketplace

Source from repo

Microsoft Foundry Skill

Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

560.1 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/observe/references/analyze-results.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown136 linesFree

foundry-agent/observe/references/analyze-results.md

1# Steps 3–5 — Download Results, Cluster Failures, Dive Into Category
2 
3## Step 3 — Download Results
4 
5`evaluation_get` returns run metadata but **not** full per-row output. Write a Python script (save to `scripts/`) to download detailed results using the **Azure AI Projects Python SDK**.
6 
7### Prerequisites
8 
9```text
10pip install azure-ai-projects>=2.0.0 azure-identity
11```
12 
13### SDK Client Setup
14 
15```python
16from azure.identity import DefaultAzureCredential
17from azure.ai.projects import AIProjectClient
18 
19project_client = AIProjectClient(
20    endpoint=project_endpoint,       # e.g. "https://<hub>.services.ai.azure.com/api/projects/<project>"
21    credential=DefaultAzureCredential(),
22)
23# The evals API lives on the OpenAI sub-client, not on AIProjectClient directly
24client = project_client.get_openai_client()
25```
26 
27> ⚠️ **Common mistake:** Calling `project_client.evals` directly — the `evals` namespace is on the OpenAI client returned by `get_openai_client()`, not on `AIProjectClient` itself.
28 
29### Retrieve Run Status
30 
31```python
32run = client.evals.runs.retrieve(run_id=run_id, eval_id=eval_id)
33print(f"Status: {run.status}  Report: {run.report_url}")
34```
35 
36### Download Per-Row Output Items
37 
38The SDK handles pagination automatically — no manual `has_more` / `after` loop required.
39 
40```python
41output_items = list(client.evals.runs.output_items.list(run_id=run_id, eval_id=eval_id))
42all_items = [item.model_dump() for item in output_items]
43```
44 
45> 💡 **Tip:** Use `model_dump()` to convert each SDK object to a plain dict for JSON serialization.
46 
47### Data Structure
48 
49Query/response data lives in `datasource_item.query` and `datasource_item['sample.output_text']`, **not** in `sample.input`/`sample.output` (which are empty arrays). Parse `datasource_item` fields when extracting queries and responses for analysis.
50 
51> ⚠️ **LLM judge knowledge cutoff:** When evaluating agents that use real-time data sources (web search, Bing Grounding, live APIs), the LLM judge may flag factually correct but temporally recent responses as "fabricated" or "unverifiable" because the judge's training data predates the agent's live results. Check failure reasons for phrases like "cannot verify," "beyond knowledge cutoff," or "no evidence" before treating them as real failures. See Behavioral Rule 13 in `observe.md` for mitigations.
52 
53### Custom Evaluator Dual-Entry Parsing
54 
55Custom evaluators produce **two** result entries per item in the `results` array:
56 
57| Entry | `metric` field | Has score? | Has reason/label/passed? |
58|-------|----------------|------------|--------------------------|
59| Entry 1 | `"custom_score"` | ✅ numeric score | ❌ null |
60| Entry 2 | `"{evaluator_name}"` | ❌ null | ✅ real reason, label, passed |
61 
62To get the complete picture, merge both entries:
63 
64```python
65def extract_evaluator_result(item, evaluator_name):
66    """Merge the dual entries for a custom evaluator into one result."""
67    score_entry = None
68    detail_entry = None
69    for r in item.get("results", []):
70        metric = r.get("metric", "")
71        if metric == "custom_score":
72            score_entry = r
73        elif metric == evaluator_name:
74            detail_entry = r
75    if not detail_entry:
76        return None
77    return {
78        "score": score_entry.get("score") if score_entry else None,
79        "passed": detail_entry.get("passed"),
80        "reason": detail_entry.get("reason"),
81        "label": detail_entry.get("label"),
82    }
83```
84 
85> ⚠️ **Common mistake:** Reading only the first matching result entry for a custom evaluator gives you the score but null reason (or vice versa). Always merge both entries. Built-in evaluators do **not** have this dual-entry pattern - they produce a single entry with all fields populated.
86 
87**Evidence from actual eval run** (item 1, `behavioral_adherence`):
88 
89```jsonc
90// Entry 1: has score, null reason
91{"name": "behavioral_adherence", "metric": "custom_score", "score": 1, "reason": null, "passed": null}
92 
93// Entry 2: has reason, null score
94{"name": "behavioral_adherence", "metric": "behavioral_adherence", "score": null,
95 "reason": "The response provides outdated and fabricated information...", "passed": false}
96```
97 
98### Persist Results
99 
100Save results to `.foundry/results/<environment>/<eval-id>/<run-id>.json` (use `json.dump` with `default=str` for non-serializable fields). Print summary: total items, passed, failed, errored counts.
101 
102> ⚠️ **Show Data Viewer deeplinks (for VS Code runtime only):** Append a Data Viewer deeplink immediately after reference to an evaluation result file in your response. Format: "[Open in Data Viewer](vscode://ms-windows-ai-studio.windows-ai-studio/open_data_viewer?file=<file_path>&source=microsoft-foundry-skill) for details and perform analysis".
103 
104## Step 4 — Cluster Failures by Root Cause
105 
106Analyze every row in the results. Group failures into clusters:
107 
108| Cluster | Description |
109|---------|-------------|
110| Incorrect / hallucinated answer | Agent gave a wrong or fabricated response |
111| Incomplete answer | Agent missed key parts |
112| Tool call failure | Agent failed to invoke or misused a tool |
113| Safety / content violation | Flagged by safety evaluators |
114| Runtime error | Agent crashed or returned an error |
115| Off-topic / refusal | Agent refused or went off-topic |
116 
117Produce a prioritized action table:
118 
119| Focus | Cluster | Suggested Action |
120|-------|---------|------------------|
121| Runtime blockers | Runtime errors or failing suites tagged `tier=smoke` | Check container logs or fix blockers first |
122| Key regressions | Incorrect answers on suites tagged `purpose=regression` or `tier=smoke` | Optimize prompt or tool instructions |
123| Broader quality gaps | Incomplete answers or coverage-oriented suites | Optimize prompt or expand context |
124| Tooling issues | Tool call failures | Fix tool definitions or instructions |
125| Safety issues | Safety violations | Add guardrails to instructions |
126 
127**Rule:** Prioritize runtime errors first, then suites tagged `tier=smoke`, then suites tagged `purpose=regression`, then broader coverage suites by count × severity.
128 
129## Step 5 — Dive Into Category
130 
131When the user wants to inspect a specific cluster, display the individual rows: evaluation-suite ID, input query, the agent's original response, evaluator scores, and failure reason. Let the user confirm which category or evaluation suite to optimize.
132 
133## Next Steps
134 
135After clustering -> proceed to [Step 6: Optimize Prompt](optimize-deploy.md).
136

Microsoft Foundry Skill

foundry-agent/observe/references/analyze-results.md

Preparing the source view

Microsoft Foundry Skill

foundry-agent/observe/references/analyze-results.md