Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Deploy, evaluate, and manage AI agents end-to-end on Microsoft Azure AI Foundry
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/observe/references/analyze-results.md
1# Steps 3–5 — Download Results, Cluster Failures, Dive Into Category23## Step 3 — Download Results45`evaluation_get` returns run metadata but **not** full per-row output. Write a Python script (save to `scripts/`) to download detailed results using the **Azure AI Projects Python SDK**.67### Prerequisites89```text10pip install azure-ai-projects>=2.0.0 azure-identity11```1213### SDK Client Setup1415```python16from azure.identity import DefaultAzureCredential17from azure.ai.projects import AIProjectClient1819project_client = AIProjectClient(20endpoint=project_endpoint, # e.g. "https://<hub>.services.ai.azure.com/api/projects/<project>"21credential=DefaultAzureCredential(),22)23# The evals API lives on the OpenAI sub-client, not on AIProjectClient directly24client = project_client.get_openai_client()25```2627> ⚠️ **Common mistake:** Calling `project_client.evals` directly — the `evals` namespace is on the OpenAI client returned by `get_openai_client()`, not on `AIProjectClient` itself.2829### Retrieve Run Status3031```python32run = client.evals.runs.retrieve(run_id=run_id, eval_id=eval_id)33print(f"Status: {run.status} Report: {run.report_url}")34```3536### Download Per-Row Output Items3738The SDK handles pagination automatically — no manual `has_more` / `after` loop required.3940```python41output_items = list(client.evals.runs.output_items.list(run_id=run_id, eval_id=eval_id))42all_items = [item.model_dump() for item in output_items]43```4445> 💡 **Tip:** Use `model_dump()` to convert each SDK object to a plain dict for JSON serialization.4647### Data Structure4849Query/response data lives in `datasource_item.query` and `datasource_item['sample.output_text']`, **not** in `sample.input`/`sample.output` (which are empty arrays). Parse `datasource_item` fields when extracting queries and responses for analysis.5051> ⚠️ **LLM judge knowledge cutoff:** When evaluating agents that use real-time data sources (web search, Bing Grounding, live APIs), the LLM judge may flag factually correct but temporally recent responses as "fabricated" or "unverifiable" because the judge's training data predates the agent's live results. Check failure reasons for phrases like "cannot verify," "beyond knowledge cutoff," or "no evidence" before treating them as real failures. See Behavioral Rule 13 in `observe.md` for mitigations.5253### Custom Evaluator Dual-Entry Parsing5455Custom evaluators produce **two** result entries per item in the `results` array:5657| Entry | `metric` field | Has score? | Has reason/label/passed? |58|-------|----------------|------------|--------------------------|59| Entry 1 | `"custom_score"` | ✅ numeric score | ❌ null |60| Entry 2 | `"{evaluator_name}"` | ❌ null | ✅ real reason, label, passed |6162To get the complete picture, merge both entries:6364```python65def extract_evaluator_result(item, evaluator_name):66"""Merge the dual entries for a custom evaluator into one result."""67score_entry = None68detail_entry = None69for r in item.get("results", []):70metric = r.get("metric", "")71if metric == "custom_score":72score_entry = r73elif metric == evaluator_name:74detail_entry = r75if not detail_entry:76return None77return {78"score": score_entry.get("score") if score_entry else None,79"passed": detail_entry.get("passed"),80"reason": detail_entry.get("reason"),81"label": detail_entry.get("label"),82}83```8485> ⚠️ **Common mistake:** Reading only the first matching result entry for a custom evaluator gives you the score but null reason (or vice versa). Always merge both entries. Built-in evaluators do **not** have this dual-entry pattern - they produce a single entry with all fields populated.8687**Evidence from actual eval run** (item 1, `behavioral_adherence`):8889```jsonc90// Entry 1: has score, null reason91{"name": "behavioral_adherence", "metric": "custom_score", "score": 1, "reason": null, "passed": null}9293// Entry 2: has reason, null score94{"name": "behavioral_adherence", "metric": "behavioral_adherence", "score": null,95"reason": "The response provides outdated and fabricated information...", "passed": false}96```9798### Persist Results99100Save results to `.foundry/results/<environment>/<eval-id>/<run-id>.json` (use `json.dump` with `default=str` for non-serializable fields). Print summary: total items, passed, failed, errored counts.101102> ⚠️ **Show Data Viewer deeplinks (for VS Code runtime only):** Append a Data Viewer deeplink immediately after reference to an evaluation result file in your response. Format: "[Open in Data Viewer](vscode://ms-windows-ai-studio.windows-ai-studio/open_data_viewer?file=<file_path>&source=microsoft-foundry-skill) for details and perform analysis".103104## Step 4 — Cluster Failures by Root Cause105106Analyze every row in the results. Group failures into clusters:107108| Cluster | Description |109|---------|-------------|110| Incorrect / hallucinated answer | Agent gave a wrong or fabricated response |111| Incomplete answer | Agent missed key parts |112| Tool call failure | Agent failed to invoke or misused a tool |113| Safety / content violation | Flagged by safety evaluators |114| Runtime error | Agent crashed or returned an error |115| Off-topic / refusal | Agent refused or went off-topic |116117Produce a prioritized action table:118119| Focus | Cluster | Suggested Action |120|-------|---------|------------------|121| Runtime blockers | Runtime errors or failing suites tagged `tier=smoke` | Check container logs or fix blockers first |122| Key regressions | Incorrect answers on suites tagged `purpose=regression` or `tier=smoke` | Optimize prompt or tool instructions |123| Broader quality gaps | Incomplete answers or coverage-oriented suites | Optimize prompt or expand context |124| Tooling issues | Tool call failures | Fix tool definitions or instructions |125| Safety issues | Safety violations | Add guardrails to instructions |126127**Rule:** Prioritize runtime errors first, then suites tagged `tier=smoke`, then suites tagged `purpose=regression`, then broader coverage suites by count × severity.128129## Step 5 — Dive Into Category130131When the user wants to inspect a specific cluster, display the individual rows: evaluation-suite ID, input query, the agent's original response, evaluator scores, and failure reason. Let the user confirm which category or evaluation suite to optimize.132133## Next Steps134135After clustering -> proceed to [Step 6: Optimize Prompt](optimize-deploy.md).136