Source from repo

Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.

muratcankoylanGitHub muratcankoylanSource repo Original GitHub link

Files

241

Skill

n/a

Size

2.6 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

examples/llm-as-judge-skills/tools/evaluation/direct-score.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown160 linesFree

examples/llm-as-judge-skills/tools/evaluation/direct-score.md

1# Direct Score Tool
2 
3## Purpose
4 
5Evaluate a single LLM response against defined criteria using a scoring rubric.
6 
7## Tool Definition
8 
9```typescript
10import { tool } from "ai";
11import { z } from "zod";
12 
13export const directScore = tool({
14  description: `Evaluate a response by scoring it against specific criteria.
15Use this for objective evaluations where you need to assess quality 
16dimensions like accuracy, completeness, clarity, or task adherence.
17Returns structured scores with justifications.`,
18 
19  parameters: z.object({
20    response: z.string()
21      .describe("The LLM response to evaluate"),
22    
23    prompt: z.string()
24      .describe("The original prompt/instruction that generated the response"),
25    
26    context: z.string().optional()
27      .describe("Additional context like retrieved documents or conversation history"),
28    
29    criteria: z.array(z.object({
30      name: z.string().describe("Name of the criterion (e.g., 'Accuracy')"),
31      description: z.string().describe("What this criterion measures"),
32      weight: z.number().min(0).max(1).default(1)
33        .describe("Relative importance, weights should sum to 1")
34    })).min(1).describe("Evaluation criteria to score against"),
35    
36    rubric: z.object({
37      scale: z.enum(["1-3", "1-5", "1-10"]).default("1-5"),
38      levelDescriptions: z.record(z.string(), z.string()).optional()
39        .describe("Optional descriptions for each score level")
40    }).optional().describe("Scoring rubric configuration")
41  }),
42 
43  execute: async (input) => {
44    // Implementation delegates to evaluator LLM
45    return evaluateWithLLM(input);
46  }
47});
48```
49 
50## Input Schema
51 
52| Field | Type | Required | Description |
53|-------|------|----------|-------------|
54| response | string | Yes | The response being evaluated |
55| prompt | string | Yes | Original prompt that generated the response |
56| context | string | No | Additional context (RAG docs, history) |
57| criteria | Criterion[] | Yes | List of evaluation criteria |
58| rubric | Rubric | No | Scoring scale and level descriptions |
59 
60### Criterion Object
61```typescript
62{
63  name: string;        // e.g., "Factual Accuracy"
64  description: string; // e.g., "Response contains no factual errors"
65  weight: number;      // 0-1, relative importance
66}
67```
68 
69### Rubric Object
70```typescript
71{
72  scale: "1-3" | "1-5" | "1-10";
73  levelDescriptions?: {
74    "1": "Poor - Major issues",
75    "2": "Below Average - Several issues",
76    "3": "Average - Some issues",
77    "4": "Good - Minor issues",
78    "5": "Excellent - No issues"
79  }
80}
81```
82 
83## Output Schema
84 
85```typescript
86interface DirectScoreResult {
87  success: boolean;
88  
89  scores: {
90    criterion: string;
91    score: number;
92    maxScore: number;
93    justification: string;
94    examples: string[];  // Specific examples from response
95  }[];
96  
97  overallScore: number;
98  weightedScore: number;
99  
100  summary: {
101    strengths: string[];
102    weaknesses: string[];
103    suggestions: string[];
104  };
105  
106  metadata: {
107    evaluationTimeMs: number;
108    criteriaCount: number;
109    rubricScale: string;
110  };
111}
112```
113 
114## Usage Example
115 
116```typescript
117const result = await directScore.execute({
118  response: "Machine learning is a subset of AI that enables systems to learn from data...",
119  
120  prompt: "Explain machine learning to a beginner",
121  
122  criteria: [
123    {
124      name: "Accuracy",
125      description: "Technical correctness of explanations",
126      weight: 0.4
127    },
128    {
129      name: "Clarity",
130      description: "Understandable for a beginner",
131      weight: 0.3
132    },
133    {
134      name: "Completeness",
135      description: "Covers key concepts adequately",
136      weight: 0.3
137    }
138  ],
139  
140  rubric: {
141    scale: "1-5",
142    levelDescriptions: {
143      "1": "Completely fails criterion",
144      "2": "Major deficiencies",
145      "3": "Acceptable but improvable",
146      "4": "Good with minor issues",
147      "5": "Excellent, no issues"
148    }
149  }
150});
151```
152 
153## Implementation Notes
154 
1551. **Chain-of-Thought**: Implementation should use CoT prompting for more reliable scoring
1562. **Calibration**: Include few-shot examples of scores at each level
1573. **Justification First**: Ask for justification before score to reduce bias
1584. **Length Normalization**: Consider response length when appropriate
159 
160

Preparing the source view

Agent Skills for Context Engineering

examples/llm-as-judge-skills/tools/evaluation/direct-score.md