Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

152

Skill

n/a

Size

941.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/observe/references/continuous-eval.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown245 linesFree

foundry-agent/observe/references/continuous-eval.md

1# Continuous Evaluation
2 
3Enable, configure, disable, or remove continuous evaluation for a Foundry agent. Continuous evaluation automatically assesses agent responses on an ongoing basis using configured evaluators (e.g., groundedness, coherence, violence detection). This is typically the final step in the [observe loop](../observe.md) after deploying and batch-evaluating an agent — it keeps production quality visible without manual intervention.
4 
5## When to Use This Skill
6 
7USE FOR: enable continuous evaluation, disable continuous evaluation, configure continuous eval, set up monitoring evaluators, check continuous eval status, delete continuous eval, update evaluators, change sampling rate, change eval interval, production monitoring, ongoing agent quality.
8 
9DO NOT USE FOR: running a one-off batch evaluation (use [observe](../observe.md)), querying traces (use [trace](../../trace/trace.md)), creating evaluator definitions (use [observe](../observe.md) Step 1).
10 
11## Quick Reference
12 
13| Property | Value |
14|----------|-------|
15| MCP server | `azure` |
16| Key MCP tools | `continuous_eval_create`, `continuous_eval_get`, `continuous_eval_delete`, `agent_get`, `evaluation_get` |
17| Prerequisite | Agent must exist in the project |
18| Local cache | `.foundry/agent-metadata.yaml` |
19 
20## Entry Points
21 
22| User Intent | Start At |
23|-------------|----------|
24| "Enable continuous eval" / "Set up monitoring evaluators" | [Before Starting](#before-starting--detect-current-state) → [Enable or Update](#enable-or-update) |
25| "Is continuous eval running?" / "Check eval status" | [Before Starting](#before-starting--detect-current-state) → [Check Current State](#check-current-state) |
26| "Change evaluators" / "Update sampling rate" | [Before Starting](#before-starting--detect-current-state) → [Check Current State](#check-current-state) → [Enable or Update](#enable-or-update) |
27| "Pause evaluations" / "Disable continuous eval" | [Before Starting](#before-starting--detect-current-state) → [Disable](#disable) |
28| "Stop evaluating this agent" / "Delete continuous eval" | [Before Starting](#before-starting--detect-current-state) → [Delete](#delete) |
29| "Scores are dropping" / "Act on monitoring results" | [Before Starting](#before-starting--detect-current-state) → [Acting on Results](#acting-on-results) |
30 
31> ⚠️ **Important:** Always run [Before Starting](#before-starting--detect-current-state) to resolve the project endpoint and agent name before calling any MCP tools.
32 
33## Before Starting — Detect Current State
34 
351. Resolve the target agent root and environment using the [Common Project Context Resolution](../../../SKILL.md#agent-common-project-context-resolution) workflow.
362. Extract `projectEndpoint` and `agentName` from the selected environment. If not available in metadata, use `ask_user` to collect them.
373. Use `agent_get` to verify the agent exists and note its kind (prompt or hosted).
384. Use `continuous_eval_get` to check for existing continuous evaluation configuration.
395. Jump to the appropriate entry point based on user intent.
40 
41## How It Works
42 
43The tool auto-detects the agent's kind and uses the appropriate backend:
44 
45- **Prompt agents** — evaluation runs are triggered automatically each time the agent produces a response. Parameters: `samplingRate` (percentage of responses to evaluate), `maxHourlyRuns`.
46- **Hosted agents** — evaluation runs are triggered on an hourly schedule, pulling recent traces from App Insights. Parameters: `intervalHours` (hours between runs), `maxTraces` (max data points per run).
47 
48The user does not need to choose between these — the tool handles it based on agent kind.
49 
50## Behavioral Rules
51 
521. **Always resolve context first.** Run [Before Starting](#before-starting--detect-current-state) before calling any MCP tool. Never assume a project endpoint or agent name.
532. **Check before creating.** Always call `continuous_eval_get` before `continuous_eval_create` to determine whether to create or update. Present existing configuration to the user.
543. **Confirm evaluator selection.** Present the evaluator list to the user before enabling. Distinguish quality evaluators (require `deploymentName`) from safety evaluators (do not).
554. **Prompt for next steps.** After each operation, present options. Never assume the path forward (e.g., after enabling, offer to check status or adjust parameters).
565. **Keep context visible.** Include the project endpoint, agent name, and environment in operation summaries.
576. **Use `continuous_eval_get` for IDs.** The `delete` tool requires a `configId` — always retrieve it from the `get` response rather than asking the user to provide it.
587. **Surface the remediation path.** When presenting continuous eval results that show score degradation, always offer to route into the [observe skill](../observe.md) for diagnosis and optimization. Monitoring without action is incomplete.
598. **Handle agent-not-found.** If `agent_get` returns a not-found error, stop the continuous eval flow. Offer to route to the [deploy skill](../../deploy/deploy.md) to create the agent first, or ask the user to verify the agent name and environment.
609. **Handle auth and endpoint errors.** If `agent_get` or `continuous_eval_create` returns a permission or authentication error, verify the project endpoint, environment, and user access. Do not suggest creating the agent — the issue is access, not existence.
6110. **Validate `deploymentName` before enabling.** Do not assume `gpt-4o` exists. If quality evaluators are selected, verify a chat-capable deployment is available in the project. If none exists, stop and explain that quality evaluators cannot be enabled until a compatible deployment is provisioned.
6211. **Handle invalid evaluator names.** If `continuous_eval_create` returns an invalid evaluator name error, call `evaluator_catalog_get` to list available evaluators and present valid options. Do not retry with the same arguments.
6312. **Handle unexpected empty config.** If `continuous_eval_get` returns an empty list for an agent the user believes has continuous eval configured, verify the agent name and project endpoint match the intended environment in `.foundry/agent-metadata.yaml`. The configuration may exist under a different environment or resolved `agentName`.
64 
65## Operations
66 
67### Check Current State
68 
69Before enabling or modifying, check what's already configured:
70 
71```yaml
72Tool: continuous_eval_get
73Arguments:
74  projectEndpoint: <project endpoint>
75  agentName: <agent name>
76```
77 
78- Empty list → no continuous eval configured. Proceed to [Enable or Update](#enable-or-update).
79- Non-empty list → agent already has continuous eval. Present the configuration and ask what the user wants to change.
80 
81> ⚠️ **Empty result is not proof of absence.** If the user expects a config to exist but the list is empty, verify the project endpoint and agent name match the intended environment before concluding it was never set up.
82 
83### Enable or Update
84 
85**Replace Semantics**: `continuous_eval_create` always creates a new evaluation group with the provided evaluators and points the evaluation rule at it. Always pass the complete desired configuration on every call — omitted evaluators are dropped, not preserved.
86 
87> ⚠️ **Do not assume `gpt-4o` exists.** Before setting `deploymentName`, verify a chat-capable deployment is available in the project. If none exists, quality evaluators cannot be enabled — only safety evaluators (which do not require a deployment) will work.
88 
89```yaml
90Tool: continuous_eval_create
91Arguments:
92  projectEndpoint: <project endpoint>
93  agentName: <agent name>
94  evaluatorNames: ["groundedness", "coherence", "fluency"]  # Illustrative — align with your batch eval evaluators
95  deploymentName: "gpt-4o"          # Required for quality evaluators
96  enabled: true                      # Set false to disable without deleting
97```
98 
99**Evaluator selection guidance:**
100- **Quality evaluators** (require `deploymentName`): coherence, fluency, relevance, groundedness, intent_resolution, task_adherence, tool_call_accuracy
101- **Safety evaluators** (no `deploymentName` needed): violence, sexual, self_harm, hate_unfairness, indirect_attack, code_vulnerability, protected_material
102- Custom evaluators from the project's evaluator catalog are also supported by name.
103 
104**Optional parameters by agent kind:**
105 
106| Parameter | Applies To | Description | Default |
107|-----------|-----------|-------------|---------|
108| `samplingRate` | Prompt | Percentage of responses to evaluate (1-100) | All responses |
109| `maxHourlyRuns` | Prompt | Cap on evaluation runs per hour | No limit |
110| `intervalHours` | Hosted | Hours between evaluation runs | 1 |
111| `maxTraces` | Hosted | Max data points per evaluation run | 1000 |
112| `scenario` | Prompt | Evaluation scenario: `standard` (quality and safety metrics, default) or `business` (business success metrics). An agent can have one of each simultaneously. | `standard` |
113 
114### Disable
115 
116To temporarily disable without changing configuration, pass the configuration currently in use along with `enabled: false`. Because `continuous_eval_create` has replace semantics, omitting parameters will change the configuration when re-enabled. The `continuous_eval_get` response does not include evaluator names directly — they are stored in the linked evaluation group — so retrieve them via `evaluation_get` first. If multiple configurations are returned in the `continuous_eval_get` response, present the list to the user and ask which to target.
117 
118```yaml
119# Step 1: Get the evalId, then retrieve current evaluators from the eval group
120Tool: continuous_eval_get
121Arguments:
122  projectEndpoint: <project endpoint>
123  agentName: <agent name>
124# Note the evalId from the response
125```
126 
127```yaml
128Tool: evaluation_get
129Arguments:
130  projectEndpoint: <project endpoint>
131  evalId: <evalId from above>
132# Note the evaluator names from the evaluation group's testing criteria
133```
134 
135```yaml
136# Step 2: Disable with the same evaluators
137Tool: continuous_eval_create
138Arguments:
139  projectEndpoint: <project endpoint>
140  agentName: <agent name>
141  evaluatorNames: ["groundedness", "coherence", "fluency"]  # Must match current config
142  deploymentName: "gpt-4o"
143  enabled: false
144```
145 
146### Delete
147 
148To permanently remove continuous evaluation configuration:
149 
150```yaml
151Tool: continuous_eval_delete
152Arguments:
153  projectEndpoint: <project endpoint>
154  configId: <id from continuous_eval_get>
155  agentName: <agent name>
156```
157 
158Always call `continuous_eval_get` first to retrieve the `id` field of the configuration to delete. If multiple configurations are returned, present the list to the user and ask which to target.
159 
160## Acting on Results
161 
162Continuous evaluation generates ongoing scores — but monitoring is only useful when you **act** on what it reveals. This section covers how to consume evaluation results and the remediation loop when scores degrade.
163 
164### Step 1: Read Evaluation Scores
165 
166The `continuous_eval_get` response includes an `evalId` that links to the evaluation group. Use this to retrieve actual run results:
167 
168```yaml
169Tool: continuous_eval_get
170Arguments:
171  projectEndpoint: <project endpoint>
172  agentName: <agent name>
173# Note the evalId from the response
174```
175 
176```yaml
177Tool: evaluation_get
178Arguments:
179  projectEndpoint: <project endpoint>
180  evalId: <evalId from continuous_eval_get>
181  isRequestForRuns: true
182# Returns evaluation runs with per-evaluator scores
183```
184 
185Review the run results for score trends. Each run contains scores for every configured evaluator. Look for:
186- **Scores below threshold** — any evaluator consistently scoring below your acceptable baseline
187- **Score degradation over time** — scores that were previously healthy but are trending downward
188- **Safety flags** — any non-zero safety evaluator scores that indicate harmful content
189 
190### Step 2: Triage the Regression
191 
1921. **Identify the failing evaluators.** From the evaluation runs, note which specific evaluators are scoring low (e.g., `groundedness` dropping from 4.2 to 2.8).
1932. **Correlate with traces.** Use the [trace skill](../../trace/trace.md) to search App Insights for the conversations that triggered low scores. Look for patterns: specific query types, tool-call failures, or grounding gaps.
1943. **Compare to baseline.** If batch eval results exist in `.foundry/results/`, compare continuous eval scores against the last known-good batch run to determine whether this is a new regression or a pre-existing gap.
195 
196### Step 3: Remediate via the Observe Loop
197 
198Once you understand the failure pattern, use the [observe skill](../observe.md) to fix it:
199 
200| Symptom | Action |
201|---------|--------|
202| Quality scores dropping (coherence, relevance, task_adherence) | Run [Step 3: Analyze](analyze-results.md) to cluster failures, then [Step 4: Optimize](optimize-deploy.md) to improve the prompt |
203| Safety evaluators flagging (violence, indirect_attack) | Review flagged traces via [trace skill](../../trace/trace.md), then update agent instructions or tool definitions to address the pattern |
204| Grounding failures | Check whether the agent's data sources are still accessible and returning expected results; update knowledge index or tool configuration |
205| Scores fluctuating after a deploy | Run [Step 5: Compare](compare-iterate.md) between the current and previous agent version to isolate the regression |
206 
207### Step 4: Verify the Fix
208 
209After deploying a fix through the observe loop:
210 
2111. **Re-run a batch eval** via [observe](../observe.md) Step 2 against the same test cases to confirm the fix.
2122. **Read continuous eval scores** from the next evaluation cycle using `evaluation_get` with the `evalId` — verify scores have recovered.
2133. **Adjust evaluators if needed.** If the regression exposed a gap in evaluator coverage, use `continuous_eval_create` to update the configuration with additional or refined evaluators.
214 
215> 💡 **Tip:** The continuous eval → observe → deploy → continuous eval cycle is the core production quality loop. Continuous eval detects; observe diagnoses and fixes; continuous eval verifies.
216 
217## Response Format
218 
219All tools return a unified `ContinuousEvalConfig` shape. The `get` tool returns a list; `create` returns a single object.
220 
221| Field | Description | Present For |
222|-------|-------------|-------------|
223| `id` | Configuration identifier (needed for delete) | All |
224| `displayName` | Human-readable name | All |
225| `enabled` | Whether evaluation is active | All |
226| `evalId` | Linked evaluation group containing evaluator definitions | All |
227| `agentName` | Target agent name | All |
228| `status` | Provisioning status | Hosted only |
229| `scenario` | Evaluation scenario (`standard` or `business`) | Prompt only |
230| `samplingRate` | Percentage of responses evaluated | Prompt only |
231| `maxHourlyRuns` | Cap on runs per hour | Prompt only |
232| `intervalHours` | Hours between scheduled runs | Hosted only |
233| `maxTraces` | Max data points per run | Hosted only |
234| `createdAt` | Creation timestamp | All |
235| `createdBy` | Creator identity | All |
236 
237## Related Skills
238 
239| User Intent | Skill |
240|-------------|-------|
241| "Evaluate my agent" / "Run a batch eval" | [observe skill](../observe.md) |
242| "Scores are dropping" / "Diagnose and fix quality regression" | [observe skill](../observe.md) (Steps 3–5) |
243| "Analyze production traces" / "Find flagged conversations" | [trace skill](../../trace/trace.md) |
244| "Deploy my agent" / "Redeploy after fix" | [deploy skill](../../deploy/deploy.md) |
245

Marketplace

Source from repo

Microsoft Foundry Skill

Build and deploy AI applications on Azure AI Foundry using Microsoft's model catalog and AI services

microsoftGitHub microsoftOfficialSource repo Original GitHub link Publisher page

Files

152

Skill

n/a

Size

941.0 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

foundry-agent/observe/references/continuous-eval.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown245 linesFree

foundry-agent/observe/references/continuous-eval.md

1# Continuous Evaluation
2 
3Enable, configure, disable, or remove continuous evaluation for a Foundry agent. Continuous evaluation automatically assesses agent responses on an ongoing basis using configured evaluators (e.g., groundedness, coherence, violence detection). This is typically the final step in the [observe loop](../observe.md) after deploying and batch-evaluating an agent — it keeps production quality visible without manual intervention.
4 
5## When to Use This Skill
6 
7USE FOR: enable continuous evaluation, disable continuous evaluation, configure continuous eval, set up monitoring evaluators, check continuous eval status, delete continuous eval, update evaluators, change sampling rate, change eval interval, production monitoring, ongoing agent quality.
8 
9DO NOT USE FOR: running a one-off batch evaluation (use [observe](../observe.md)), querying traces (use [trace](../../trace/trace.md)), creating evaluator definitions (use [observe](../observe.md) Step 1).
10 
11## Quick Reference
12 
13| Property | Value |
14|----------|-------|
15| MCP server | `azure` |
16| Key MCP tools | `continuous_eval_create`, `continuous_eval_get`, `continuous_eval_delete`, `agent_get`, `evaluation_get` |
17| Prerequisite | Agent must exist in the project |
18| Local cache | `.foundry/agent-metadata.yaml` |
19 
20## Entry Points
21 
22| User Intent | Start At |
23|-------------|----------|
24| "Enable continuous eval" / "Set up monitoring evaluators" | [Before Starting](#before-starting--detect-current-state) → [Enable or Update](#enable-or-update) |
25| "Is continuous eval running?" / "Check eval status" | [Before Starting](#before-starting--detect-current-state) → [Check Current State](#check-current-state) |
26| "Change evaluators" / "Update sampling rate" | [Before Starting](#before-starting--detect-current-state) → [Check Current State](#check-current-state) → [Enable or Update](#enable-or-update) |
27| "Pause evaluations" / "Disable continuous eval" | [Before Starting](#before-starting--detect-current-state) → [Disable](#disable) |
28| "Stop evaluating this agent" / "Delete continuous eval" | [Before Starting](#before-starting--detect-current-state) → [Delete](#delete) |
29| "Scores are dropping" / "Act on monitoring results" | [Before Starting](#before-starting--detect-current-state) → [Acting on Results](#acting-on-results) |
30 
31> ⚠️ **Important:** Always run [Before Starting](#before-starting--detect-current-state) to resolve the project endpoint and agent name before calling any MCP tools.
32 
33## Before Starting — Detect Current State
34 
351. Resolve the target agent root and environment using the [Common Project Context Resolution](../../../SKILL.md#agent-common-project-context-resolution) workflow.
362. Extract `projectEndpoint` and `agentName` from the selected environment. If not available in metadata, use `ask_user` to collect them.
373. Use `agent_get` to verify the agent exists and note its kind (prompt or hosted).
384. Use `continuous_eval_get` to check for existing continuous evaluation configuration.
395. Jump to the appropriate entry point based on user intent.
40 
41## How It Works
42 
43The tool auto-detects the agent's kind and uses the appropriate backend:
44 
45- **Prompt agents** — evaluation runs are triggered automatically each time the agent produces a response. Parameters: `samplingRate` (percentage of responses to evaluate), `maxHourlyRuns`.
46- **Hosted agents** — evaluation runs are triggered on an hourly schedule, pulling recent traces from App Insights. Parameters: `intervalHours` (hours between runs), `maxTraces` (max data points per run).
47 
48The user does not need to choose between these — the tool handles it based on agent kind.
49 
50## Behavioral Rules
51 
521. **Always resolve context first.** Run [Before Starting](#before-starting--detect-current-state) before calling any MCP tool. Never assume a project endpoint or agent name.
532. **Check before creating.** Always call `continuous_eval_get` before `continuous_eval_create` to determine whether to create or update. Present existing configuration to the user.
543. **Confirm evaluator selection.** Present the evaluator list to the user before enabling. Distinguish quality evaluators (require `deploymentName`) from safety evaluators (do not).
554. **Prompt for next steps.** After each operation, present options. Never assume the path forward (e.g., after enabling, offer to check status or adjust parameters).
565. **Keep context visible.** Include the project endpoint, agent name, and environment in operation summaries.
576. **Use `continuous_eval_get` for IDs.** The `delete` tool requires a `configId` — always retrieve it from the `get` response rather than asking the user to provide it.
587. **Surface the remediation path.** When presenting continuous eval results that show score degradation, always offer to route into the [observe skill](../observe.md) for diagnosis and optimization. Monitoring without action is incomplete.
598. **Handle agent-not-found.** If `agent_get` returns a not-found error, stop the continuous eval flow. Offer to route to the [deploy skill](../../deploy/deploy.md) to create the agent first, or ask the user to verify the agent name and environment.
609. **Handle auth and endpoint errors.** If `agent_get` or `continuous_eval_create` returns a permission or authentication error, verify the project endpoint, environment, and user access. Do not suggest creating the agent — the issue is access, not existence.
6110. **Validate `deploymentName` before enabling.** Do not assume `gpt-4o` exists. If quality evaluators are selected, verify a chat-capable deployment is available in the project. If none exists, stop and explain that quality evaluators cannot be enabled until a compatible deployment is provisioned.
6211. **Handle invalid evaluator names.** If `continuous_eval_create` returns an invalid evaluator name error, call `evaluator_catalog_get` to list available evaluators and present valid options. Do not retry with the same arguments.
6312. **Handle unexpected empty config.** If `continuous_eval_get` returns an empty list for an agent the user believes has continuous eval configured, verify the agent name and project endpoint match the intended environment in `.foundry/agent-metadata.yaml`. The configuration may exist under a different environment or resolved `agentName`.
64 
65## Operations
66 
67### Check Current State
68 
69Before enabling or modifying, check what's already configured:
70 
71```yaml
72Tool: continuous_eval_get
73Arguments:
74  projectEndpoint: <project endpoint>
75  agentName: <agent name>
76```
77 
78- Empty list → no continuous eval configured. Proceed to [Enable or Update](#enable-or-update).
79- Non-empty list → agent already has continuous eval. Present the configuration and ask what the user wants to change.
80 
81> ⚠️ **Empty result is not proof of absence.** If the user expects a config to exist but the list is empty, verify the project endpoint and agent name match the intended environment before concluding it was never set up.
82 
83### Enable or Update
84 
85**Replace Semantics**: `continuous_eval_create` always creates a new evaluation group with the provided evaluators and points the evaluation rule at it. Always pass the complete desired configuration on every call — omitted evaluators are dropped, not preserved.
86 
87> ⚠️ **Do not assume `gpt-4o` exists.** Before setting `deploymentName`, verify a chat-capable deployment is available in the project. If none exists, quality evaluators cannot be enabled — only safety evaluators (which do not require a deployment) will work.
88 
89```yaml
90Tool: continuous_eval_create
91Arguments:
92  projectEndpoint: <project endpoint>
93  agentName: <agent name>
94  evaluatorNames: ["groundedness", "coherence", "fluency"]  # Illustrative — align with your batch eval evaluators
95  deploymentName: "gpt-4o"          # Required for quality evaluators
96  enabled: true                      # Set false to disable without deleting
97```
98 
99**Evaluator selection guidance:**
100- **Quality evaluators** (require `deploymentName`): coherence, fluency, relevance, groundedness, intent_resolution, task_adherence, tool_call_accuracy
101- **Safety evaluators** (no `deploymentName` needed): violence, sexual, self_harm, hate_unfairness, indirect_attack, code_vulnerability, protected_material
102- Custom evaluators from the project's evaluator catalog are also supported by name.
103 
104**Optional parameters by agent kind:**
105 
106| Parameter | Applies To | Description | Default |
107|-----------|-----------|-------------|---------|
108| `samplingRate` | Prompt | Percentage of responses to evaluate (1-100) | All responses |
109| `maxHourlyRuns` | Prompt | Cap on evaluation runs per hour | No limit |
110| `intervalHours` | Hosted | Hours between evaluation runs | 1 |
111| `maxTraces` | Hosted | Max data points per evaluation run | 1000 |
112| `scenario` | Prompt | Evaluation scenario: `standard` (quality and safety metrics, default) or `business` (business success metrics). An agent can have one of each simultaneously. | `standard` |
113 
114### Disable
115 
116To temporarily disable without changing configuration, pass the configuration currently in use along with `enabled: false`. Because `continuous_eval_create` has replace semantics, omitting parameters will change the configuration when re-enabled. The `continuous_eval_get` response does not include evaluator names directly — they are stored in the linked evaluation group — so retrieve them via `evaluation_get` first. If multiple configurations are returned in the `continuous_eval_get` response, present the list to the user and ask which to target.
117 
118```yaml
119# Step 1: Get the evalId, then retrieve current evaluators from the eval group
120Tool: continuous_eval_get
121Arguments:
122  projectEndpoint: <project endpoint>
123  agentName: <agent name>
124# Note the evalId from the response
125```
126 
127```yaml
128Tool: evaluation_get
129Arguments:
130  projectEndpoint: <project endpoint>
131  evalId: <evalId from above>
132# Note the evaluator names from the evaluation group's testing criteria
133```
134 
135```yaml
136# Step 2: Disable with the same evaluators
137Tool: continuous_eval_create
138Arguments:
139  projectEndpoint: <project endpoint>
140  agentName: <agent name>
141  evaluatorNames: ["groundedness", "coherence", "fluency"]  # Must match current config
142  deploymentName: "gpt-4o"
143  enabled: false
144```
145 
146### Delete
147 
148To permanently remove continuous evaluation configuration:
149 
150```yaml
151Tool: continuous_eval_delete
152Arguments:
153  projectEndpoint: <project endpoint>
154  configId: <id from continuous_eval_get>
155  agentName: <agent name>
156```
157 
158Always call `continuous_eval_get` first to retrieve the `id` field of the configuration to delete. If multiple configurations are returned, present the list to the user and ask which to target.
159 
160## Acting on Results
161 
162Continuous evaluation generates ongoing scores — but monitoring is only useful when you **act** on what it reveals. This section covers how to consume evaluation results and the remediation loop when scores degrade.
163 
164### Step 1: Read Evaluation Scores
165 
166The `continuous_eval_get` response includes an `evalId` that links to the evaluation group. Use this to retrieve actual run results:
167 
168```yaml
169Tool: continuous_eval_get
170Arguments:
171  projectEndpoint: <project endpoint>
172  agentName: <agent name>
173# Note the evalId from the response
174```
175 
176```yaml
177Tool: evaluation_get
178Arguments:
179  projectEndpoint: <project endpoint>
180  evalId: <evalId from continuous_eval_get>
181  isRequestForRuns: true
182# Returns evaluation runs with per-evaluator scores
183```
184 
185Review the run results for score trends. Each run contains scores for every configured evaluator. Look for:
186- **Scores below threshold** — any evaluator consistently scoring below your acceptable baseline
187- **Score degradation over time** — scores that were previously healthy but are trending downward
188- **Safety flags** — any non-zero safety evaluator scores that indicate harmful content
189 
190### Step 2: Triage the Regression
191 
1921. **Identify the failing evaluators.** From the evaluation runs, note which specific evaluators are scoring low (e.g., `groundedness` dropping from 4.2 to 2.8).
1932. **Correlate with traces.** Use the [trace skill](../../trace/trace.md) to search App Insights for the conversations that triggered low scores. Look for patterns: specific query types, tool-call failures, or grounding gaps.
1943. **Compare to baseline.** If batch eval results exist in `.foundry/results/`, compare continuous eval scores against the last known-good batch run to determine whether this is a new regression or a pre-existing gap.
195 
196### Step 3: Remediate via the Observe Loop
197 
198Once you understand the failure pattern, use the [observe skill](../observe.md) to fix it:
199 
200| Symptom | Action |
201|---------|--------|
202| Quality scores dropping (coherence, relevance, task_adherence) | Run [Step 3: Analyze](analyze-results.md) to cluster failures, then [Step 4: Optimize](optimize-deploy.md) to improve the prompt |
203| Safety evaluators flagging (violence, indirect_attack) | Review flagged traces via [trace skill](../../trace/trace.md), then update agent instructions or tool definitions to address the pattern |
204| Grounding failures | Check whether the agent's data sources are still accessible and returning expected results; update knowledge index or tool configuration |
205| Scores fluctuating after a deploy | Run [Step 5: Compare](compare-iterate.md) between the current and previous agent version to isolate the regression |
206 
207### Step 4: Verify the Fix
208 
209After deploying a fix through the observe loop:
210 
2111. **Re-run a batch eval** via [observe](../observe.md) Step 2 against the same test cases to confirm the fix.
2122. **Read continuous eval scores** from the next evaluation cycle using `evaluation_get` with the `evalId` — verify scores have recovered.
2133. **Adjust evaluators if needed.** If the regression exposed a gap in evaluator coverage, use `continuous_eval_create` to update the configuration with additional or refined evaluators.
214 
215> 💡 **Tip:** The continuous eval → observe → deploy → continuous eval cycle is the core production quality loop. Continuous eval detects; observe diagnoses and fixes; continuous eval verifies.
216 
217## Response Format
218 
219All tools return a unified `ContinuousEvalConfig` shape. The `get` tool returns a list; `create` returns a single object.
220 
221| Field | Description | Present For |
222|-------|-------------|-------------|
223| `id` | Configuration identifier (needed for delete) | All |
224| `displayName` | Human-readable name | All |
225| `enabled` | Whether evaluation is active | All |
226| `evalId` | Linked evaluation group containing evaluator definitions | All |
227| `agentName` | Target agent name | All |
228| `status` | Provisioning status | Hosted only |
229| `scenario` | Evaluation scenario (`standard` or `business`) | Prompt only |
230| `samplingRate` | Percentage of responses evaluated | Prompt only |
231| `maxHourlyRuns` | Cap on runs per hour | Prompt only |
232| `intervalHours` | Hours between scheduled runs | Hosted only |
233| `maxTraces` | Max data points per run | Hosted only |
234| `createdAt` | Creation timestamp | All |
235| `createdBy` | Creator identity | All |
236 
237## Related Skills
238 
239| User Intent | Skill |
240|-------------|-------|
241| "Evaluate my agent" / "Run a batch eval" | [observe skill](../observe.md) |
242| "Scores are dropping" / "Diagnose and fix quality regression" | [observe skill](../observe.md) (Steps 3–5) |
243| "Analyze production traces" / "Find flagged conversations" | [trace skill](../../trace/trace.md) |
244| "Deploy my agent" / "Redeploy after fix" | [deploy skill](../../deploy/deploy.md) |
245

Microsoft Foundry Skill

foundry-agent/observe/references/continuous-eval.md

Preparing the source view

Microsoft Foundry Skill

foundry-agent/observe/references/continuous-eval.md