Continuous Evaluation
Enable, configure, disable, or remove continuous evaluation for a Foundry agent. Continuous evaluation automatically assesses agent responses on an ongoing basis using configured evaluators (e.g., groundedness, coherence, violence detection). This is typically the final step in the observe loop after deploying and batch-evaluating an agent — it keeps production quality visible without manual intervention.
When to Use This Skill
USE FOR: enable continuous evaluation, disable continuous evaluation, configure continuous eval, set up monitoring evaluators, check continuous eval status, delete continuous eval, update evaluators, change sampling rate, change eval interval, production monitoring, ongoing agent quality.
DO NOT USE FOR: running a one-off batch evaluation (use observe), querying traces (use trace), creating evaluator definitions (use observe Step 1).
Quick Reference
| Property | Value |
|---|---|
| MCP server | azure |
| Key MCP tools | continuous_eval_create, continuous_eval_get, continuous_eval_delete, agent_get, evaluation_get |
| Prerequisite | Agent must exist in the project |
| Local cache | .foundry/agent-metadata.yaml |
Entry Points
| User Intent | Start At |
|---|---|
| "Enable continuous eval" / "Set up monitoring evaluators" | Before Starting → Enable or Update |
| "Is continuous eval running?" / "Check eval status" | Before Starting → Check Current State |
| "Change evaluators" / "Update sampling rate" | Before Starting → Check Current State → Enable or Update |
| "Pause evaluations" / "Disable continuous eval" | Before Starting → Disable |
| "Stop evaluating this agent" / "Delete continuous eval" | Before Starting → Delete |
| "Scores are dropping" / "Act on monitoring results" | Before Starting → Acting on Results |
⚠️ Important: Always run Before Starting to resolve the project endpoint and agent name before calling any MCP tools.
Before Starting — Detect Current State
- Resolve the target agent root and environment from
.foundry/agent-metadata.yamlusing the Project Context Resolution workflow. - Extract
projectEndpointandagentNamefrom the selected environment. If not available in metadata, useask_userto collect them. - Use
agent_getto verify the agent exists and note its kind (prompt or hosted). - Use
continuous_eval_getto check for existing continuous evaluation configuration. - Jump to the appropriate entry point based on user intent.
How It Works
The tool auto-detects the agent's kind and uses the appropriate backend:
- Prompt agents — evaluation runs are triggered automatically each time the agent produces a response. Parameters:
samplingRate(percentage of responses to evaluate),maxHourlyRuns. - Hosted agents — evaluation runs are triggered on an hourly schedule, pulling recent traces from App Insights. Parameters:
intervalHours(hours between runs),maxTraces(max data points per run).
The user does not need to choose between these — the tool handles it based on agent kind.
Behavioral Rules
- Always resolve context first. Run Before Starting before calling any MCP tool. Never assume a project endpoint or agent name.
- Check before creating. Always call
continuous_eval_getbeforecontinuous_eval_createto determine whether to create or update. Present existing configuration to the user. - Confirm evaluator selection. Present the evaluator list to the user before enabling. Distinguish quality evaluators (require
deploymentName) from safety evaluators (do not). - Prompt for next steps. After each operation, present options. Never assume the path forward (e.g., after enabling, offer to check status or adjust parameters).
- Keep context visible. Include the project endpoint, agent name, and environment in operation summaries.
- Use
continuous_eval_getfor IDs. Thedeletetool requires aconfigId— always retrieve it from thegetresponse rather than asking the user to provide it. - Surface the remediation path. When presenting continuous eval results that show score degradation, always offer to route into the observe skill for diagnosis and optimization. Monitoring without action is incomplete.
- Handle agent-not-found. If
agent_getreturns a not-found error, stop the continuous eval flow. Offer to route to the deploy skill to create the agent first, or ask the user to verify the agent name and environment. - Handle auth and endpoint errors. If
agent_getorcontinuous_eval_createreturns a permission or authentication error, verify the project endpoint, environment, and user access. Do not suggest creating the agent — the issue is access, not existence. - Validate
deploymentNamebefore enabling. Do not assumegpt-4oexists. If quality evaluators are selected, verify a chat-capable deployment is available in the project. If none exists, stop and explain that quality evaluators cannot be enabled until a compatible deployment is provisioned. - Handle invalid evaluator names. If
continuous_eval_createreturns an invalid evaluator name error, callevaluator_catalog_getto list available evaluators and present valid options. Do not retry with the same arguments. - Handle unexpected empty config. If
continuous_eval_getreturns an empty list for an agent the user believes has continuous eval configured, verify the agent name and project endpoint match the intended environment in.foundry/agent-metadata.yaml. The configuration may exist under a different environment or resolvedagentName.
Operations
Check Current State
Before enabling or modifying, check what's already configured:
Tool: continuous_eval_get
Arguments:
projectEndpoint: <project endpoint>
agentName: <agent name>- Empty list → no continuous eval configured. Proceed to Enable or Update.
- Non-empty list → agent already has continuous eval. Present the configuration and ask what the user wants to change.
⚠️ Empty result is not proof of absence. If the user expects a config to exist but the list is empty, verify the project endpoint and agent name match the intended environment before concluding it was never set up.
Enable or Update
Replace Semantics: continuous_eval_create always creates a new evaluation group with the provided evaluators and points the evaluation rule at it. Always pass the complete desired configuration on every call — omitted evaluators are dropped, not preserved.
⚠️ Do not assume
gpt-4oexists. Before settingdeploymentName, verify a chat-capable deployment is available in the project. If none exists, quality evaluators cannot be enabled — only safety evaluators (which do not require a deployment) will work.
Tool: continuous_eval_create
Arguments:
projectEndpoint: <project endpoint>
agentName: <agent name>
evaluatorNames: ["groundedness", "coherence", "fluency"] # Illustrative — align with your batch eval evaluators
deploymentName: "gpt-4o" # Required for quality evaluators
enabled: true # Set false to disable without deletingEvaluator selection guidance:
- Quality evaluators (require
deploymentName): coherence, fluency, relevance, groundedness, intentresolution, taskadherence, toolcallaccuracy - Safety evaluators (no
deploymentNameneeded): violence, sexual, selfharm, hateunfairness, indirectattack, codevulnerability, protected_material - Custom evaluators from the project's evaluator catalog are also supported by name.
Optional parameters by agent kind:
| Parameter | Applies To | Description | Default |
|---|---|---|---|
samplingRate | Prompt | Percentage of responses to evaluate (1-100) | All responses |
maxHourlyRuns | Prompt | Cap on evaluation runs per hour | No limit |
intervalHours | Hosted | Hours between evaluation runs | 1 |
maxTraces | Hosted | Max data points per evaluation run | 1000 |
scenario | Prompt | Evaluation scenario: standard (quality and safety metrics, default) or business (business success metrics). An agent can have one of each simultaneously. | standard |
Disable
To temporarily disable without changing configuration, pass the configuration currently in use along with enabled: false. Because continuous_eval_create has replace semantics, omitting parameters will change the configuration when re-enabled. The continuous_eval_get response does not include evaluator names directly — they are stored in the linked evaluation group — so retrieve them via evaluation_get first. If multiple configurations are returned in the continuous_eval_get response, present the list to the user and ask which to target.
# Step 1: Get the evalId, then retrieve current evaluators from the eval group
Tool: continuous_eval_get
Arguments:
projectEndpoint: <project endpoint>
agentName: <agent name>
# Note the evalId from the responseTool: evaluation_get
Arguments:
projectEndpoint: <project endpoint>
evalId: <evalId from above>
# Note the evaluator names from the evaluation group's testing criteria# Step 2: Disable with the same evaluators
Tool: continuous_eval_create
Arguments:
projectEndpoint: <project endpoint>
agentName: <agent name>
evaluatorNames: ["groundedness", "coherence", "fluency"] # Must match current config
deploymentName: "gpt-4o"
enabled: falseDelete
To permanently remove continuous evaluation configuration:
Tool: continuous_eval_delete
Arguments:
projectEndpoint: <project endpoint>
configId: <id from continuous_eval_get>
agentName: <agent name>Always call continuous_eval_get first to retrieve the id field of the configuration to delete. If multiple configurations are returned, present the list to the user and ask which to target.
Acting on Results
Continuous evaluation generates ongoing scores — but monitoring is only useful when you act on what it reveals. This section covers how to consume evaluation results and the remediation loop when scores degrade.
Step 1: Read Evaluation Scores
The continuous_eval_get response includes an evalId that links to the evaluation group. Use this to retrieve actual run results:
Tool: continuous_eval_get
Arguments:
projectEndpoint: <project endpoint>
agentName: <agent name>
# Note the evalId from the responseTool: evaluation_get
Arguments:
projectEndpoint: <project endpoint>
evalId: <evalId from continuous_eval_get>
isRequestForRuns: true
# Returns evaluation runs with per-evaluator scoresReview the run results for score trends. Each run contains scores for every configured evaluator. Look for:
- Scores below threshold — any evaluator consistently scoring below your acceptable baseline
- Score degradation over time — scores that were previously healthy but are trending downward
- Safety flags — any non-zero safety evaluator scores that indicate harmful content
Step 2: Triage the Regression
- Identify the failing evaluators. From the evaluation runs, note which specific evaluators are scoring low (e.g.,
groundednessdropping from 4.2 to 2.8). - Correlate with traces. Use the trace skill to search App Insights for the conversations that triggered low scores. Look for patterns: specific query types, tool-call failures, or grounding gaps.
- Compare to baseline. If batch eval results exist in
.foundry/results/, compare continuous eval scores against the last known-good batch run to determine whether this is a new regression or a pre-existing gap.
Step 3: Remediate via the Observe Loop
Once you understand the failure pattern, use the observe skill to fix it:
| Symptom | Action |
|---|---|
| Quality scores dropping (coherence, relevance, task_adherence) | Run Step 3: Analyze to cluster failures, then Step 4: Optimize to improve the prompt |
| Safety evaluators flagging (violence, indirect_attack) | Review flagged traces via trace skill, then update agent instructions or tool definitions to address the pattern |
| Grounding failures | Check whether the agent's data sources are still accessible and returning expected results; update knowledge index or tool configuration |
| Scores fluctuating after a deploy | Run Step 5: Compare between the current and previous agent version to isolate the regression |
Step 4: Verify the Fix
After deploying a fix through the observe loop:
- Re-run a batch eval via observe Step 2 against the same test cases to confirm the fix.
- Read continuous eval scores from the next evaluation cycle using
evaluation_getwith theevalId— verify scores have recovered. - Adjust evaluators if needed. If the regression exposed a gap in evaluator coverage, use
continuous_eval_createto update the configuration with additional or refined evaluators.
💡 Tip: The continuous eval → observe → deploy → continuous eval cycle is the core production quality loop. Continuous eval detects; observe diagnoses and fixes; continuous eval verifies.
Response Format
All tools return a unified ContinuousEvalConfig shape. The get tool returns a list; create returns a single object.
| Field | Description | Present For |
|---|---|---|
id | Configuration identifier (needed for delete) | All |
displayName | Human-readable name | All |
enabled | Whether evaluation is active | All |
evalId | Linked evaluation group containing evaluator definitions | All |
agentName | Target agent name | All |
status | Provisioning status | Hosted only |
scenario | Evaluation scenario (standard or business) | Prompt only |
samplingRate | Percentage of responses evaluated | Prompt only |
maxHourlyRuns | Cap on runs per hour | Prompt only |
intervalHours | Hours between scheduled runs | Hosted only |
maxTraces | Max data points per run | Hosted only |
createdAt | Creation timestamp | All |
createdBy | Creator identity | All |
Related Skills
| User Intent | Skill |
|---|---|
| "Evaluate my agent" / "Run a batch eval" | observe skill |
| "Scores are dropping" / "Diagnose and fix quality regression" | observe skill (Steps 3–5) |
| "Analyze production traces" / "Find flagged conversations" | trace skill |
| "Deploy my agent" / "Redeploy after fix" | deploy skill |