Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
examples/llm-as-judge-skills/tools/evaluation/direct-score.md
1# Direct Score Tool23## Purpose45Evaluate a single LLM response against defined criteria using a scoring rubric.67## Tool Definition89```typescript10import { tool } from "ai";11import { z } from "zod";1213export const directScore = tool({14description: `Evaluate a response by scoring it against specific criteria.15Use this for objective evaluations where you need to assess quality16dimensions like accuracy, completeness, clarity, or task adherence.17Returns structured scores with justifications.`,1819parameters: z.object({20response: z.string()21.describe("The LLM response to evaluate"),2223prompt: z.string()24.describe("The original prompt/instruction that generated the response"),2526context: z.string().optional()27.describe("Additional context like retrieved documents or conversation history"),2829criteria: z.array(z.object({30name: z.string().describe("Name of the criterion (e.g., 'Accuracy')"),31description: z.string().describe("What this criterion measures"),32weight: z.number().min(0).max(1).default(1)33.describe("Relative importance, weights should sum to 1")34})).min(1).describe("Evaluation criteria to score against"),3536rubric: z.object({37scale: z.enum(["1-3", "1-5", "1-10"]).default("1-5"),38levelDescriptions: z.record(z.string(), z.string()).optional()39.describe("Optional descriptions for each score level")40}).optional().describe("Scoring rubric configuration")41}),4243execute: async (input) => {44// Implementation delegates to evaluator LLM45return evaluateWithLLM(input);46}47});48```4950## Input Schema5152| Field | Type | Required | Description |53|-------|------|----------|-------------|54| response | string | Yes | The response being evaluated |55| prompt | string | Yes | Original prompt that generated the response |56| context | string | No | Additional context (RAG docs, history) |57| criteria | Criterion[] | Yes | List of evaluation criteria |58| rubric | Rubric | No | Scoring scale and level descriptions |5960### Criterion Object61```typescript62{63name: string; // e.g., "Factual Accuracy"64description: string; // e.g., "Response contains no factual errors"65weight: number; // 0-1, relative importance66}67```6869### Rubric Object70```typescript71{72scale: "1-3" | "1-5" | "1-10";73levelDescriptions?: {74"1": "Poor - Major issues",75"2": "Below Average - Several issues",76"3": "Average - Some issues",77"4": "Good - Minor issues",78"5": "Excellent - No issues"79}80}81```8283## Output Schema8485```typescript86interface DirectScoreResult {87success: boolean;8889scores: {90criterion: string;91score: number;92maxScore: number;93justification: string;94examples: string[]; // Specific examples from response95}[];9697overallScore: number;98weightedScore: number;99100summary: {101strengths: string[];102weaknesses: string[];103suggestions: string[];104};105106metadata: {107evaluationTimeMs: number;108criteriaCount: number;109rubricScale: string;110};111}112```113114## Usage Example115116```typescript117const result = await directScore.execute({118response: "Machine learning is a subset of AI that enables systems to learn from data...",119120prompt: "Explain machine learning to a beginner",121122criteria: [123{124name: "Accuracy",125description: "Technical correctness of explanations",126weight: 0.4127},128{129name: "Clarity",130description: "Understandable for a beginner",131weight: 0.3132},133{134name: "Completeness",135description: "Covers key concepts adequately",136weight: 0.3137}138],139140rubric: {141scale: "1-5",142levelDescriptions: {143"1": "Completely fails criterion",144"2": "Major deficiencies",145"3": "Acceptable but improvable",146"4": "Good with minor issues",147"5": "Excellent, no issues"148}149}150});151```152153## Implementation Notes1541551. **Chain-of-Thought**: Implementation should use CoT prompting for more reliable scoring1562. **Calibration**: Include few-shot examples of scores at each level1573. **Justification First**: Ask for justification before score to reduce bias1584. **Length Normalization**: Consider response length when appropriate159160