Source from repo

Skill Creator

Create, test, and iteratively improve Claude skills with eval benchmarks and description optimization

anthropicsGitHub anthropicsOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

219.7 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

agents/comparator.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown203 linesFree

agents/comparator.md

1# Blind Comparator Agent
2 
3Compare two outputs WITHOUT knowing which skill produced them.
4 
5## Role
6 
7The Blind Comparator judges which output better accomplishes the eval task. You receive two outputs labeled A and B, but you do NOT know which skill produced which. This prevents bias toward a particular skill or approach.
8 
9Your judgment is based purely on output quality and task completion.
10 
11## Inputs
12 
13You receive these parameters in your prompt:
14 
15- **output_a_path**: Path to the first output file or directory
16- **output_b_path**: Path to the second output file or directory
17- **eval_prompt**: The original task/prompt that was executed
18- **expectations**: List of expectations to check (optional - may be empty)
19 
20## Process
21 
22### Step 1: Read Both Outputs
23 
241. Examine output A (file or directory)
252. Examine output B (file or directory)
263. Note the type, structure, and content of each
274. If outputs are directories, examine all relevant files inside
28 
29### Step 2: Understand the Task
30 
311. Read the eval_prompt carefully
322. Identify what the task requires:
33   - What should be produced?
34   - What qualities matter (accuracy, completeness, format)?
35   - What would distinguish a good output from a poor one?
36 
37### Step 3: Generate Evaluation Rubric
38 
39Based on the task, generate a rubric with two dimensions:
40 
41**Content Rubric** (what the output contains):
42| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
43|-----------|----------|----------------|---------------|
44| Correctness | Major errors | Minor errors | Fully correct |
45| Completeness | Missing key elements | Mostly complete | All elements present |
46| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
47 
48**Structure Rubric** (how the output is organized):
49| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
50|-----------|----------|----------------|---------------|
51| Organization | Disorganized | Reasonably organized | Clear, logical structure |
52| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
53| Usability | Difficult to use | Usable with effort | Easy to use |
54 
55Adapt criteria to the specific task. For example:
56- PDF form → "Field alignment", "Text readability", "Data placement"
57- Document → "Section structure", "Heading hierarchy", "Paragraph flow"
58- Data output → "Schema correctness", "Data types", "Completeness"
59 
60### Step 4: Evaluate Each Output Against the Rubric
61 
62For each output (A and B):
63 
641. **Score each criterion** on the rubric (1-5 scale)
652. **Calculate dimension totals**: Content score, Structure score
663. **Calculate overall score**: Average of dimension scores, scaled to 1-10
67 
68### Step 5: Check Assertions (if provided)
69 
70If expectations are provided:
71 
721. Check each expectation against output A
732. Check each expectation against output B
743. Count pass rates for each output
754. Use expectation scores as secondary evidence (not the primary decision factor)
76 
77### Step 6: Determine the Winner
78 
79Compare A and B based on (in priority order):
80 
811. **Primary**: Overall rubric score (content + structure)
822. **Secondary**: Assertion pass rates (if applicable)
833. **Tiebreaker**: If truly equal, declare a TIE
84 
85Be decisive - ties should be rare. One output is usually better, even if marginally.
86 
87### Step 7: Write Comparison Results
88 
89Save results to a JSON file at the path specified (or `comparison.json` if not specified).
90 
91## Output Format
92 
93Write a JSON file with this structure:
94 
95```json
96{
97  "winner": "A",
98  "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
99  "rubric": {
100    "A": {
101      "content": {
102        "correctness": 5,
103        "completeness": 5,
104        "accuracy": 4
105      },
106      "structure": {
107        "organization": 4,
108        "formatting": 5,
109        "usability": 4
110      },
111      "content_score": 4.7,
112      "structure_score": 4.3,
113      "overall_score": 9.0
114    },
115    "B": {
116      "content": {
117        "correctness": 3,
118        "completeness": 2,
119        "accuracy": 3
120      },
121      "structure": {
122        "organization": 3,
123        "formatting": 2,
124        "usability": 3
125      },
126      "content_score": 2.7,
127      "structure_score": 2.7,
128      "overall_score": 5.4
129    }
130  },
131  "output_quality": {
132    "A": {
133      "score": 9,
134      "strengths": ["Complete solution", "Well-formatted", "All fields present"],
135      "weaknesses": ["Minor style inconsistency in header"]
136    },
137    "B": {
138      "score": 5,
139      "strengths": ["Readable output", "Correct basic structure"],
140      "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
141    }
142  },
143  "expectation_results": {
144    "A": {
145      "passed": 4,
146      "total": 5,
147      "pass_rate": 0.80,
148      "details": [
149        {"text": "Output includes name", "passed": true},
150        {"text": "Output includes date", "passed": true},
151        {"text": "Format is PDF", "passed": true},
152        {"text": "Contains signature", "passed": false},
153        {"text": "Readable text", "passed": true}
154      ]
155    },
156    "B": {
157      "passed": 3,
158      "total": 5,
159      "pass_rate": 0.60,
160      "details": [
161        {"text": "Output includes name", "passed": true},
162        {"text": "Output includes date", "passed": false},
163        {"text": "Format is PDF", "passed": true},
164        {"text": "Contains signature", "passed": false},
165        {"text": "Readable text", "passed": true}
166      ]
167    }
168  }
169}
170```
171 
172If no expectations were provided, omit the `expectation_results` field entirely.
173 
174## Field Descriptions
175 
176- **winner**: "A", "B", or "TIE"
177- **reasoning**: Clear explanation of why the winner was chosen (or why it's a tie)
178- **rubric**: Structured rubric evaluation for each output
179  - **content**: Scores for content criteria (correctness, completeness, accuracy)
180  - **structure**: Scores for structure criteria (organization, formatting, usability)
181  - **content_score**: Average of content criteria (1-5)
182  - **structure_score**: Average of structure criteria (1-5)
183  - **overall_score**: Combined score scaled to 1-10
184- **output_quality**: Summary quality assessment
185  - **score**: 1-10 rating (should match rubric overall_score)
186  - **strengths**: List of positive aspects
187  - **weaknesses**: List of issues or shortcomings
188- **expectation_results**: (Only if expectations provided)
189  - **passed**: Number of expectations that passed
190  - **total**: Total number of expectations
191  - **pass_rate**: Fraction passed (0.0 to 1.0)
192  - **details**: Individual expectation results
193 
194## Guidelines
195 
196- **Stay blind**: DO NOT try to infer which skill produced which output. Judge purely on output quality.
197- **Be specific**: Cite specific examples when explaining strengths and weaknesses.
198- **Be decisive**: Choose a winner unless outputs are genuinely equivalent.
199- **Output quality first**: Assertion scores are secondary to overall task completion.
200- **Be objective**: Don't favor outputs based on style preferences; focus on correctness and completeness.
201- **Explain your reasoning**: The reasoning field should make it clear why you chose the winner.
202- **Handle edge cases**: If both outputs fail, pick the one that fails less badly. If both are excellent, pick the one that's marginally better.
203

Marketplace

Source from repo

Skill Creator

Create, test, and iteratively improve Claude skills with eval benchmarks and description optimization

anthropicsGitHub anthropicsOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

219.7 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

agents/comparator.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown203 linesFree

agents/comparator.md

1# Blind Comparator Agent
2 
3Compare two outputs WITHOUT knowing which skill produced them.
4 
5## Role
6 
7The Blind Comparator judges which output better accomplishes the eval task. You receive two outputs labeled A and B, but you do NOT know which skill produced which. This prevents bias toward a particular skill or approach.
8 
9Your judgment is based purely on output quality and task completion.
10 
11## Inputs
12 
13You receive these parameters in your prompt:
14 
15- **output_a_path**: Path to the first output file or directory
16- **output_b_path**: Path to the second output file or directory
17- **eval_prompt**: The original task/prompt that was executed
18- **expectations**: List of expectations to check (optional - may be empty)
19 
20## Process
21 
22### Step 1: Read Both Outputs
23 
241. Examine output A (file or directory)
252. Examine output B (file or directory)
263. Note the type, structure, and content of each
274. If outputs are directories, examine all relevant files inside
28 
29### Step 2: Understand the Task
30 
311. Read the eval_prompt carefully
322. Identify what the task requires:
33   - What should be produced?
34   - What qualities matter (accuracy, completeness, format)?
35   - What would distinguish a good output from a poor one?
36 
37### Step 3: Generate Evaluation Rubric
38 
39Based on the task, generate a rubric with two dimensions:
40 
41**Content Rubric** (what the output contains):
42| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
43|-----------|----------|----------------|---------------|
44| Correctness | Major errors | Minor errors | Fully correct |
45| Completeness | Missing key elements | Mostly complete | All elements present |
46| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
47 
48**Structure Rubric** (how the output is organized):
49| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
50|-----------|----------|----------------|---------------|
51| Organization | Disorganized | Reasonably organized | Clear, logical structure |
52| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
53| Usability | Difficult to use | Usable with effort | Easy to use |
54 
55Adapt criteria to the specific task. For example:
56- PDF form → "Field alignment", "Text readability", "Data placement"
57- Document → "Section structure", "Heading hierarchy", "Paragraph flow"
58- Data output → "Schema correctness", "Data types", "Completeness"
59 
60### Step 4: Evaluate Each Output Against the Rubric
61 
62For each output (A and B):
63 
641. **Score each criterion** on the rubric (1-5 scale)
652. **Calculate dimension totals**: Content score, Structure score
663. **Calculate overall score**: Average of dimension scores, scaled to 1-10
67 
68### Step 5: Check Assertions (if provided)
69 
70If expectations are provided:
71 
721. Check each expectation against output A
732. Check each expectation against output B
743. Count pass rates for each output
754. Use expectation scores as secondary evidence (not the primary decision factor)
76 
77### Step 6: Determine the Winner
78 
79Compare A and B based on (in priority order):
80 
811. **Primary**: Overall rubric score (content + structure)
822. **Secondary**: Assertion pass rates (if applicable)
833. **Tiebreaker**: If truly equal, declare a TIE
84 
85Be decisive - ties should be rare. One output is usually better, even if marginally.
86 
87### Step 7: Write Comparison Results
88 
89Save results to a JSON file at the path specified (or `comparison.json` if not specified).
90 
91## Output Format
92 
93Write a JSON file with this structure:
94 
95```json
96{
97  "winner": "A",
98  "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
99  "rubric": {
100    "A": {
101      "content": {
102        "correctness": 5,
103        "completeness": 5,
104        "accuracy": 4
105      },
106      "structure": {
107        "organization": 4,
108        "formatting": 5,
109        "usability": 4
110      },
111      "content_score": 4.7,
112      "structure_score": 4.3,
113      "overall_score": 9.0
114    },
115    "B": {
116      "content": {
117        "correctness": 3,
118        "completeness": 2,
119        "accuracy": 3
120      },
121      "structure": {
122        "organization": 3,
123        "formatting": 2,
124        "usability": 3
125      },
126      "content_score": 2.7,
127      "structure_score": 2.7,
128      "overall_score": 5.4
129    }
130  },
131  "output_quality": {
132    "A": {
133      "score": 9,
134      "strengths": ["Complete solution", "Well-formatted", "All fields present"],
135      "weaknesses": ["Minor style inconsistency in header"]
136    },
137    "B": {
138      "score": 5,
139      "strengths": ["Readable output", "Correct basic structure"],
140      "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
141    }
142  },
143  "expectation_results": {
144    "A": {
145      "passed": 4,
146      "total": 5,
147      "pass_rate": 0.80,
148      "details": [
149        {"text": "Output includes name", "passed": true},
150        {"text": "Output includes date", "passed": true},
151        {"text": "Format is PDF", "passed": true},
152        {"text": "Contains signature", "passed": false},
153        {"text": "Readable text", "passed": true}
154      ]
155    },
156    "B": {
157      "passed": 3,
158      "total": 5,
159      "pass_rate": 0.60,
160      "details": [
161        {"text": "Output includes name", "passed": true},
162        {"text": "Output includes date", "passed": false},
163        {"text": "Format is PDF", "passed": true},
164        {"text": "Contains signature", "passed": false},
165        {"text": "Readable text", "passed": true}
166      ]
167    }
168  }
169}
170```
171 
172If no expectations were provided, omit the `expectation_results` field entirely.
173 
174## Field Descriptions
175 
176- **winner**: "A", "B", or "TIE"
177- **reasoning**: Clear explanation of why the winner was chosen (or why it's a tie)
178- **rubric**: Structured rubric evaluation for each output
179  - **content**: Scores for content criteria (correctness, completeness, accuracy)
180  - **structure**: Scores for structure criteria (organization, formatting, usability)
181  - **content_score**: Average of content criteria (1-5)
182  - **structure_score**: Average of structure criteria (1-5)
183  - **overall_score**: Combined score scaled to 1-10
184- **output_quality**: Summary quality assessment
185  - **score**: 1-10 rating (should match rubric overall_score)
186  - **strengths**: List of positive aspects
187  - **weaknesses**: List of issues or shortcomings
188- **expectation_results**: (Only if expectations provided)
189  - **passed**: Number of expectations that passed
190  - **total**: Total number of expectations
191  - **pass_rate**: Fraction passed (0.0 to 1.0)
192  - **details**: Individual expectation results
193 
194## Guidelines
195 
196- **Stay blind**: DO NOT try to infer which skill produced which output. Judge purely on output quality.
197- **Be specific**: Cite specific examples when explaining strengths and weaknesses.
198- **Be decisive**: Choose a winner unless outputs are genuinely equivalent.
199- **Output quality first**: Assertion scores are secondary to overall task completion.
200- **Be objective**: Don't favor outputs based on style preferences; focus on correctness and completeness.
201- **Explain your reasoning**: The reasoning field should make it clear why you chose the winner.
202- **Handle edge cases**: If both outputs fail, pick the one that fails less badly. If both are excellent, pick the one that's marginally better.
203

Skill Creator

agents/comparator.md

Preparing the source view

Skill Creator

agents/comparator.md