Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/advanced-evaluation/references/evaluation-pipeline.md
1# Evaluation Pipeline Diagram23Visual layout of a production evaluation pipeline.45```6┌─────────────────────────────────────────────────┐7│ Evaluation Pipeline │8├─────────────────────────────────────────────────┤9│ │10│ Input: Response + Prompt + Context │11│ │ │12│ ▼ │13│ ┌─────────────────────┐ │14│ │ Criteria Loader │ ◄── Rubrics, weights │15│ └──────────┬──────────┘ │16│ │ │17│ ▼ │18│ ┌─────────────────────┐ │19│ │ Primary Scorer │ ◄── Direct or Pairwise │20│ └──────────┬──────────┘ │21│ │ │22│ ▼ │23│ ┌─────────────────────┐ │24│ │ Bias Mitigation │ ◄── Position swap, etc. │25│ └──────────┬──────────┘ │26│ │ │27│ ▼ │28│ ┌─────────────────────┐ │29│ │ Confidence Scoring │ ◄── Calibration │30│ └──────────┬──────────┘ │31│ │ │32│ ▼ │33│ Output: Scores + Justifications + Confidence │34│ │35└─────────────────────────────────────────────────┘36```3738## Pipeline Stages39401. **Criteria Loader**: Loads rubrics and criterion weights from configuration412. **Primary Scorer**: Applies direct scoring or pairwise comparison423. **Bias Mitigation**: Runs position swaps, length normalization, and other debiasing434. **Confidence Scoring**: Calibrates confidence based on position consistency and evidence strength44