Eval Lineage — Full Traceability from Production to Deployment
Track the complete chain from production traces through dataset creation, evaluation runs, comparisons, and deployment decisions. Enables "why was this deployed?" audit queries and compliance reporting.
Lineage Chain
Production Trace (App Insights)
│ conversationId, responseId
▼
Dataset Version (.foundry/datasets/*.jsonl, environment-scoped)
│ metadata.conversationId, metadata.harvestRule
▼
Evaluation Run (evaluation_agent_batch_eval_create)
│ evaluationId when creating, evalId when querying, evalRunId
▼
Comparison (evaluation_comparison_create)
│ insightId, baselineRunId, treatmentRunIds
▼
Deployment Decision (agent_update)
│ agentVersion
▼
Production Trace (cycle repeats)Lineage Manifest
Track lineage in .foundry/datasets/manifest.json:
{
"datasets": [
{
"name": "support-bot-prod-traces",
"file": "support-bot-prod-traces-v3.jsonl",
"version": "v3",
"tag": "prod",
"source": "trace-harvest",
"harvestRule": "error+latency",
"timeRange": "2025-02-01 to 2025-02-07",
"exampleCount": 63,
"createdAt": "2025-02-08T10:00:00Z",
"evalRuns": [
{
"evalId": "eval-group-001",
"runId": "run-abc-123",
"agentVersion": "3",
"date": "2025-02-08T12:00:00Z",
"status": "completed"
},
{
"evalId": "eval-group-001",
"runId": "run-def-456",
"agentVersion": "4",
"date": "2025-02-10T09:00:00Z",
"status": "completed"
}
],
"comparisons": [
{
"insightId": "insight-xyz-789",
"baselineRunId": "run-abc-123",
"treatmentRunIds": ["run-def-456"],
"result": "v4 improved on 3/5 metrics",
"date": "2025-02-10T10:00:00Z"
}
],
"deployments": [
{
"agentVersion": "4",
"deployedAt": "2025-02-10T14:00:00Z",
"reason": "v4 improved coherence +25%, relevance +10% vs v3"
}
]
}
]
}Audit Queries
"Why was version X deployed?"
- Read
.foundry/datasets/manifest.json - Find entries where
deployments[].agentVersion == X - Show the comparison that justified the deployment
- Show the dataset and eval runs that informed the comparison
"What traces led to this dataset?"
- Read the dataset JSONL file
- Extract
metadata.conversationIdfrom each example - Look up each conversation in App Insights using the trace skill
"What evaluation history does this agent have?"
- Use
evaluation_getto list all evaluation groups - For each group, list runs with
isRequestForRuns=true - Build the timeline from Eval Trending
- Show comparisons from
evaluation_comparison_get
"Did this dataset version catch any regressions?"
- Find the dataset version in the manifest
- Check
evalRunsfor runs that used this dataset - Check
comparisonsfor any regression results - Cross-reference with
tag == "regression-<date>"entries
Maintaining Lineage
Update .foundry/datasets/manifest.json at each step:
| Event | Fields to Update |
|---|---|
| Dataset created | Add new entry with name, version, source, exampleCount |
| Evaluation run | Append to evalRuns[] with evalId, runId, agentVersion |
| Comparison | Append to comparisons[] with insightId, result |
| Deployment | Append to deployments[] with agentVersion, reason |
| Tag change | Update tag field |
💡 Tip: Store the evaluation group identifier as
evalIdin lineage/manifest records, even if the create call used the parameter nameevaluationId.
Next Steps
- View metric trends → Eval Trending
- Check for regressions → Eval Regression
- Harvest new traces → Trace-to-Dataset (start the next cycle)