Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
foundry-agent/observe/references/continuous-eval.md
1# Continuous Evaluation23Enable, configure, disable, or remove continuous evaluation for a Foundry agent. Continuous evaluation automatically assesses agent responses on an ongoing basis using configured evaluators (e.g., groundedness, coherence, violence detection). This is typically the final step in the [observe loop](../observe.md) after deploying and batch-evaluating an agent — it keeps production quality visible without manual intervention.45## When to Use This Skill67USE FOR: enable continuous evaluation, disable continuous evaluation, configure continuous eval, set up monitoring evaluators, check continuous eval status, delete continuous eval, update evaluators, change sampling rate, change eval interval, production monitoring, ongoing agent quality.89DO NOT USE FOR: running a one-off batch evaluation (use [observe](../observe.md)), querying traces (use [trace](../../trace/trace.md)), creating evaluator definitions (use [observe](../observe.md) Step 1).1011## Quick Reference1213| Property | Value |14|----------|-------|15| MCP server | `azure` |16| Key MCP tools | `continuous_eval_create`, `continuous_eval_get`, `continuous_eval_delete`, `agent_get`, `evaluation_get` |17| Prerequisite | Agent must exist in the project |18| Local cache | `.foundry/agent-metadata.yaml` |1920## Entry Points2122| User Intent | Start At |23|-------------|----------|24| "Enable continuous eval" / "Set up monitoring evaluators" | [Before Starting](#before-starting--detect-current-state) → [Enable or Update](#enable-or-update) |25| "Is continuous eval running?" / "Check eval status" | [Before Starting](#before-starting--detect-current-state) → [Check Current State](#check-current-state) |26| "Change evaluators" / "Update sampling rate" | [Before Starting](#before-starting--detect-current-state) → [Check Current State](#check-current-state) → [Enable or Update](#enable-or-update) |27| "Pause evaluations" / "Disable continuous eval" | [Before Starting](#before-starting--detect-current-state) → [Disable](#disable) |28| "Stop evaluating this agent" / "Delete continuous eval" | [Before Starting](#before-starting--detect-current-state) → [Delete](#delete) |29| "Scores are dropping" / "Act on monitoring results" | [Before Starting](#before-starting--detect-current-state) → [Acting on Results](#acting-on-results) |3031> ⚠️ **Important:** Always run [Before Starting](#before-starting--detect-current-state) to resolve the project endpoint and agent name before calling any MCP tools.3233## Before Starting — Detect Current State34351. Resolve the target agent root and environment from `.foundry/agent-metadata.yaml` using the [Project Context Resolution](../../../SKILL.md#agent-project-context-resolution) workflow.362. Extract `projectEndpoint` and `agentName` from the selected environment. If not available in metadata, use `ask_user` to collect them.373. Use `agent_get` to verify the agent exists and note its kind (prompt or hosted).384. Use `continuous_eval_get` to check for existing continuous evaluation configuration.395. Jump to the appropriate entry point based on user intent.4041## How It Works4243The tool auto-detects the agent's kind and uses the appropriate backend:4445- **Prompt agents** — evaluation runs are triggered automatically each time the agent produces a response. Parameters: `samplingRate` (percentage of responses to evaluate), `maxHourlyRuns`.46- **Hosted agents** — evaluation runs are triggered on an hourly schedule, pulling recent traces from App Insights. Parameters: `intervalHours` (hours between runs), `maxTraces` (max data points per run).4748The user does not need to choose between these — the tool handles it based on agent kind.4950## Behavioral Rules51521. **Always resolve context first.** Run [Before Starting](#before-starting--detect-current-state) before calling any MCP tool. Never assume a project endpoint or agent name.532. **Check before creating.** Always call `continuous_eval_get` before `continuous_eval_create` to determine whether to create or update. Present existing configuration to the user.543. **Confirm evaluator selection.** Present the evaluator list to the user before enabling. Distinguish quality evaluators (require `deploymentName`) from safety evaluators (do not).554. **Prompt for next steps.** After each operation, present options. Never assume the path forward (e.g., after enabling, offer to check status or adjust parameters).565. **Keep context visible.** Include the project endpoint, agent name, and environment in operation summaries.576. **Use `continuous_eval_get` for IDs.** The `delete` tool requires a `configId` — always retrieve it from the `get` response rather than asking the user to provide it.587. **Surface the remediation path.** When presenting continuous eval results that show score degradation, always offer to route into the [observe skill](../observe.md) for diagnosis and optimization. Monitoring without action is incomplete.598. **Handle agent-not-found.** If `agent_get` returns a not-found error, stop the continuous eval flow. Offer to route to the [deploy skill](../../deploy/deploy.md) to create the agent first, or ask the user to verify the agent name and environment.609. **Handle auth and endpoint errors.** If `agent_get` or `continuous_eval_create` returns a permission or authentication error, verify the project endpoint, environment, and user access. Do not suggest creating the agent — the issue is access, not existence.6110. **Validate `deploymentName` before enabling.** Do not assume `gpt-4o` exists. If quality evaluators are selected, verify a chat-capable deployment is available in the project. If none exists, stop and explain that quality evaluators cannot be enabled until a compatible deployment is provisioned.6211. **Handle invalid evaluator names.** If `continuous_eval_create` returns an invalid evaluator name error, call `evaluator_catalog_get` to list available evaluators and present valid options. Do not retry with the same arguments.6312. **Handle unexpected empty config.** If `continuous_eval_get` returns an empty list for an agent the user believes has continuous eval configured, verify the agent name and project endpoint match the intended environment in `.foundry/agent-metadata.yaml`. The configuration may exist under a different environment or resolved `agentName`.6465## Operations6667### Check Current State6869Before enabling or modifying, check what's already configured:7071```yaml72Tool: continuous_eval_get73Arguments:74projectEndpoint: <project endpoint>75agentName: <agent name>76```7778- Empty list → no continuous eval configured. Proceed to [Enable or Update](#enable-or-update).79- Non-empty list → agent already has continuous eval. Present the configuration and ask what the user wants to change.8081> ⚠️ **Empty result is not proof of absence.** If the user expects a config to exist but the list is empty, verify the project endpoint and agent name match the intended environment before concluding it was never set up.8283### Enable or Update8485**Replace Semantics**: `continuous_eval_create` always creates a new evaluation group with the provided evaluators and points the evaluation rule at it. Always pass the complete desired configuration on every call — omitted evaluators are dropped, not preserved.8687> ⚠️ **Do not assume `gpt-4o` exists.** Before setting `deploymentName`, verify a chat-capable deployment is available in the project. If none exists, quality evaluators cannot be enabled — only safety evaluators (which do not require a deployment) will work.8889```yaml90Tool: continuous_eval_create91Arguments:92projectEndpoint: <project endpoint>93agentName: <agent name>94evaluatorNames: ["groundedness", "coherence", "fluency"] # Illustrative — align with your batch eval evaluators95deploymentName: "gpt-4o" # Required for quality evaluators96enabled: true # Set false to disable without deleting97```9899**Evaluator selection guidance:**100- **Quality evaluators** (require `deploymentName`): coherence, fluency, relevance, groundedness, intent_resolution, task_adherence, tool_call_accuracy101- **Safety evaluators** (no `deploymentName` needed): violence, sexual, self_harm, hate_unfairness, indirect_attack, code_vulnerability, protected_material102- Custom evaluators from the project's evaluator catalog are also supported by name.103104**Optional parameters by agent kind:**105106| Parameter | Applies To | Description | Default |107|-----------|-----------|-------------|---------|108| `samplingRate` | Prompt | Percentage of responses to evaluate (1-100) | All responses |109| `maxHourlyRuns` | Prompt | Cap on evaluation runs per hour | No limit |110| `intervalHours` | Hosted | Hours between evaluation runs | 1 |111| `maxTraces` | Hosted | Max data points per evaluation run | 1000 |112| `scenario` | Prompt | Evaluation scenario: `standard` (quality and safety metrics, default) or `business` (business success metrics). An agent can have one of each simultaneously. | `standard` |113114### Disable115116To temporarily disable without changing configuration, pass the configuration currently in use along with `enabled: false`. Because `continuous_eval_create` has replace semantics, omitting parameters will change the configuration when re-enabled. The `continuous_eval_get` response does not include evaluator names directly — they are stored in the linked evaluation group — so retrieve them via `evaluation_get` first. If multiple configurations are returned in the `continuous_eval_get` response, present the list to the user and ask which to target.117118```yaml119# Step 1: Get the evalId, then retrieve current evaluators from the eval group120Tool: continuous_eval_get121Arguments:122projectEndpoint: <project endpoint>123agentName: <agent name>124# Note the evalId from the response125```126127```yaml128Tool: evaluation_get129Arguments:130projectEndpoint: <project endpoint>131evalId: <evalId from above>132# Note the evaluator names from the evaluation group's testing criteria133```134135```yaml136# Step 2: Disable with the same evaluators137Tool: continuous_eval_create138Arguments:139projectEndpoint: <project endpoint>140agentName: <agent name>141evaluatorNames: ["groundedness", "coherence", "fluency"] # Must match current config142deploymentName: "gpt-4o"143enabled: false144```145146### Delete147148To permanently remove continuous evaluation configuration:149150```yaml151Tool: continuous_eval_delete152Arguments:153projectEndpoint: <project endpoint>154configId: <id from continuous_eval_get>155agentName: <agent name>156```157158Always call `continuous_eval_get` first to retrieve the `id` field of the configuration to delete. If multiple configurations are returned, present the list to the user and ask which to target.159160## Acting on Results161162Continuous evaluation generates ongoing scores — but monitoring is only useful when you **act** on what it reveals. This section covers how to consume evaluation results and the remediation loop when scores degrade.163164### Step 1: Read Evaluation Scores165166The `continuous_eval_get` response includes an `evalId` that links to the evaluation group. Use this to retrieve actual run results:167168```yaml169Tool: continuous_eval_get170Arguments:171projectEndpoint: <project endpoint>172agentName: <agent name>173# Note the evalId from the response174```175176```yaml177Tool: evaluation_get178Arguments:179projectEndpoint: <project endpoint>180evalId: <evalId from continuous_eval_get>181isRequestForRuns: true182# Returns evaluation runs with per-evaluator scores183```184185Review the run results for score trends. Each run contains scores for every configured evaluator. Look for:186- **Scores below threshold** — any evaluator consistently scoring below your acceptable baseline187- **Score degradation over time** — scores that were previously healthy but are trending downward188- **Safety flags** — any non-zero safety evaluator scores that indicate harmful content189190### Step 2: Triage the Regression1911921. **Identify the failing evaluators.** From the evaluation runs, note which specific evaluators are scoring low (e.g., `groundedness` dropping from 4.2 to 2.8).1932. **Correlate with traces.** Use the [trace skill](../../trace/trace.md) to search App Insights for the conversations that triggered low scores. Look for patterns: specific query types, tool-call failures, or grounding gaps.1943. **Compare to baseline.** If batch eval results exist in `.foundry/results/`, compare continuous eval scores against the last known-good batch run to determine whether this is a new regression or a pre-existing gap.195196### Step 3: Remediate via the Observe Loop197198Once you understand the failure pattern, use the [observe skill](../observe.md) to fix it:199200| Symptom | Action |201|---------|--------|202| Quality scores dropping (coherence, relevance, task_adherence) | Run [Step 3: Analyze](analyze-results.md) to cluster failures, then [Step 4: Optimize](optimize-deploy.md) to improve the prompt |203| Safety evaluators flagging (violence, indirect_attack) | Review flagged traces via [trace skill](../../trace/trace.md), then update agent instructions or tool definitions to address the pattern |204| Grounding failures | Check whether the agent's data sources are still accessible and returning expected results; update knowledge index or tool configuration |205| Scores fluctuating after a deploy | Run [Step 5: Compare](compare-iterate.md) between the current and previous agent version to isolate the regression |206207### Step 4: Verify the Fix208209After deploying a fix through the observe loop:2102111. **Re-run a batch eval** via [observe](../observe.md) Step 2 against the same test cases to confirm the fix.2122. **Read continuous eval scores** from the next evaluation cycle using `evaluation_get` with the `evalId` — verify scores have recovered.2133. **Adjust evaluators if needed.** If the regression exposed a gap in evaluator coverage, use `continuous_eval_create` to update the configuration with additional or refined evaluators.214215> 💡 **Tip:** The continuous eval → observe → deploy → continuous eval cycle is the core production quality loop. Continuous eval detects; observe diagnoses and fixes; continuous eval verifies.216217## Response Format218219All tools return a unified `ContinuousEvalConfig` shape. The `get` tool returns a list; `create` returns a single object.220221| Field | Description | Present For |222|-------|-------------|-------------|223| `id` | Configuration identifier (needed for delete) | All |224| `displayName` | Human-readable name | All |225| `enabled` | Whether evaluation is active | All |226| `evalId` | Linked evaluation group containing evaluator definitions | All |227| `agentName` | Target agent name | All |228| `status` | Provisioning status | Hosted only |229| `scenario` | Evaluation scenario (`standard` or `business`) | Prompt only |230| `samplingRate` | Percentage of responses evaluated | Prompt only |231| `maxHourlyRuns` | Cap on runs per hour | Prompt only |232| `intervalHours` | Hours between scheduled runs | Hosted only |233| `maxTraces` | Max data points per run | Hosted only |234| `createdAt` | Creation timestamp | All |235| `createdBy` | Creator identity | All |236237## Related Skills238239| User Intent | Skill |240|-------------|-------|241| "Evaluate my agent" / "Run a batch eval" | [observe skill](../observe.md) |242| "Scores are dropping" / "Diagnose and fix quality regression" | [observe skill](../observe.md) (Steps 3–5) |243| "Analyze production traces" / "Find flagged conversations" | [trace skill](../../trace/trace.md) |244| "Deploy my agent" / "Redeploy after fix" | [deploy skill](../../deploy/deploy.md) |245