Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Guides creation of high-quality MCP servers in TypeScript or Python (FastMCP) to connect LLMs with external services.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
reference/evaluation.md
1# MCP Server Evaluation Guide23## Overview45This document provides guidance on creating comprehensive evaluations for MCP servers. Evaluations test whether LLMs can effectively use your MCP server to answer realistic, complex questions using only the tools provided.67---89## Quick Reference1011### Evaluation Requirements12- Create 10 human-readable questions13- Questions must be READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE14- Each question requires multiple tool calls (potentially dozens)15- Answers must be single, verifiable values16- Answers must be STABLE (won't change over time)1718### Output Format19```xml20<evaluation>21<qa_pair>22<question>Your question here</question>23<answer>Single verifiable answer</answer>24</qa_pair>25</evaluation>26```2728---2930## Purpose of Evaluations3132The measure of quality of an MCP server is NOT how well or comprehensively the server implements tools, but how well these implementations (input/output schemas, docstrings/descriptions, functionality) enable LLMs with no other context and access ONLY to the MCP servers to answer realistic and difficult questions.3334## Evaluation Overview3536Create 10 human-readable questions requiring ONLY READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE, and IDEMPOTENT operations to answer. Each question should be:37- Realistic38- Clear and concise39- Unambiguous40- Complex, requiring potentially dozens of tool calls or steps41- Answerable with a single, verifiable value that you identify in advance4243## Question Guidelines4445### Core Requirements46471. **Questions MUST be independent**48- Each question should NOT depend on the answer to any other question49- Should not assume prior write operations from processing another question50512. **Questions MUST require ONLY NON-DESTRUCTIVE AND IDEMPOTENT tool use**52- Should not instruct or require modifying state to arrive at the correct answer53543. **Questions must be REALISTIC, CLEAR, CONCISE, and COMPLEX**55- Must require another LLM to use multiple (potentially dozens of) tools or steps to answer5657### Complexity and Depth58594. **Questions must require deep exploration**60- Consider multi-hop questions requiring multiple sub-questions and sequential tool calls61- Each step should benefit from information found in previous questions62635. **Questions may require extensive paging**64- May need paging through multiple pages of results65- May require querying old data (1-2 years out-of-date) to find niche information66- The questions must be DIFFICULT67686. **Questions must require deep understanding**69- Rather than surface-level knowledge70- May pose complex ideas as True/False questions requiring evidence71- May use multiple-choice format where LLM must search different hypotheses72737. **Questions must not be solvable with straightforward keyword search**74- Do not include specific keywords from the target content75- Use synonyms, related concepts, or paraphrases76- Require multiple searches, analyzing multiple related items, extracting context, then deriving the answer7778### Tool Testing79808. **Questions should stress-test tool return values**81- May elicit tools returning large JSON objects or lists, overwhelming the LLM82- Should require understanding multiple modalities of data:83- IDs and names84- Timestamps and datetimes (months, days, years, seconds)85- File IDs, names, extensions, and mimetypes86- URLs, GIDs, etc.87- Should probe the tool's ability to return all useful forms of data88899. **Questions should MOSTLY reflect real human use cases**90- The kinds of information retrieval tasks that HUMANS assisted by an LLM would care about919210. **Questions may require dozens of tool calls**93- This challenges LLMs with limited context94- Encourages MCP server tools to reduce information returned959611. **Include ambiguous questions**97- May be ambiguous OR require difficult decisions on which tools to call98- Force the LLM to potentially make mistakes or misinterpret99- Ensure that despite AMBIGUITY, there is STILL A SINGLE VERIFIABLE ANSWER100101### Stability10210312. **Questions must be designed so the answer DOES NOT CHANGE**104- Do not ask questions that rely on "current state" which is dynamic105- For example, do not count:106- Number of reactions to a post107- Number of replies to a thread108- Number of members in a channel10911013. **DO NOT let the MCP server RESTRICT the kinds of questions you create**111- Create challenging and complex questions112- Some may not be solvable with the available MCP server tools113- Questions may require specific output formats (datetime vs. epoch time, JSON vs. MARKDOWN)114- Questions may require dozens of tool calls to complete115116## Answer Guidelines117118### Verification1191201. **Answers must be VERIFIABLE via direct string comparison**121- If the answer can be re-written in many formats, clearly specify the output format in the QUESTION122- Examples: "Use YYYY/MM/DD.", "Respond True or False.", "Answer A, B, C, or D and nothing else."123- Answer should be a single VERIFIABLE value such as:124- User ID, user name, display name, first name, last name125- Channel ID, channel name126- Message ID, string127- URL, title128- Numerical quantity129- Timestamp, datetime130- Boolean (for True/False questions)131- Email address, phone number132- File ID, file name, file extension133- Multiple choice answer134- Answers must not require special formatting or complex, structured output135- Answer will be verified using DIRECT STRING COMPARISON136137### Readability1381392. **Answers should generally prefer HUMAN-READABLE formats**140- Examples: names, first name, last name, datetime, file name, message string, URL, yes/no, true/false, a/b/c/d141- Rather than opaque IDs (though IDs are acceptable)142- The VAST MAJORITY of answers should be human-readable143144### Stability1451463. **Answers must be STABLE/STATIONARY**147- Look at old content (e.g., conversations that have ended, projects that have launched, questions answered)148- Create QUESTIONS based on "closed" concepts that will always return the same answer149- Questions may ask to consider a fixed time window to insulate from non-stationary answers150- Rely on context UNLIKELY to change151- Example: if finding a paper name, be SPECIFIC enough so answer is not confused with papers published later1521534. **Answers must be CLEAR and UNAMBIGUOUS**154- Questions must be designed so there is a single, clear answer155- Answer can be derived from using the MCP server tools156157### Diversity1581595. **Answers must be DIVERSE**160- Answer should be a single VERIFIABLE value in diverse modalities and formats161- User concept: user ID, user name, display name, first name, last name, email address, phone number162- Channel concept: channel ID, channel name, channel topic163- Message concept: message ID, message string, timestamp, month, day, year1641656. **Answers must NOT be complex structures**166- Not a list of values167- Not a complex object168- Not a list of IDs or strings169- Not natural language text170- UNLESS the answer can be straightforwardly verified using DIRECT STRING COMPARISON171- And can be realistically reproduced172- It should be unlikely that an LLM would return the same list in any other order or format173174## Evaluation Process175176### Step 1: Documentation Inspection177178Read the documentation of the target API to understand:179- Available endpoints and functionality180- If ambiguity exists, fetch additional information from the web181- Parallelize this step AS MUCH AS POSSIBLE182- Ensure each subagent is ONLY examining documentation from the file system or on the web183184### Step 2: Tool Inspection185186List the tools available in the MCP server:187- Inspect the MCP server directly188- Understand input/output schemas, docstrings, and descriptions189- WITHOUT calling the tools themselves at this stage190191### Step 3: Developing Understanding192193Repeat steps 1 & 2 until you have a good understanding:194- Iterate multiple times195- Think about the kinds of tasks you want to create196- Refine your understanding197- At NO stage should you READ the code of the MCP server implementation itself198- Use your intuition and understanding to create reasonable, realistic, but VERY challenging tasks199200### Step 4: Read-Only Content Inspection201202After understanding the API and tools, USE the MCP server tools:203- Inspect content using READ-ONLY and NON-DESTRUCTIVE operations ONLY204- Goal: identify specific content (e.g., users, channels, messages, projects, tasks) for creating realistic questions205- Should NOT call any tools that modify state206- Will NOT read the code of the MCP server implementation itself207- Parallelize this step with individual sub-agents pursuing independent explorations208- Ensure each subagent is only performing READ-ONLY, NON-DESTRUCTIVE, and IDEMPOTENT operations209- BE CAREFUL: SOME TOOLS may return LOTS OF DATA which would cause you to run out of CONTEXT210- Make INCREMENTAL, SMALL, AND TARGETED tool calls for exploration211- In all tool call requests, use the `limit` parameter to limit results (<10)212- Use pagination213214### Step 5: Task Generation215216After inspecting the content, create 10 human-readable questions:217- An LLM should be able to answer these with the MCP server218- Follow all question and answer guidelines above219220## Output Format221222Each QA pair consists of a question and an answer. The output should be an XML file with this structure:223224```xml225<evaluation>226<qa_pair>227<question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>228<answer>Website Redesign</answer>229</qa_pair>230<qa_pair>231<question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>232<answer>sarah_dev</answer>233</qa_pair>234<qa_pair>235<question>Look for pull requests that modified files in the /api directory and were merged between January 1 and January 31, 2024. How many different contributors worked on these PRs?</question>236<answer>7</answer>237</qa_pair>238<qa_pair>239<question>Find the repository with the most stars that was created before 2023. What is the repository name?</question>240<answer>data-pipeline</answer>241</qa_pair>242</evaluation>243```244245## Evaluation Examples246247### Good Questions248249**Example 1: Multi-hop question requiring deep exploration (GitHub MCP)**250```xml251<qa_pair>252<question>Find the repository that was archived in Q3 2023 and had previously been the most forked project in the organization. What was the primary programming language used in that repository?</question>253<answer>Python</answer>254</qa_pair>255```256257This question is good because:258- Requires multiple searches to find archived repositories259- Needs to identify which had the most forks before archival260- Requires examining repository details for the language261- Answer is a simple, verifiable value262- Based on historical (closed) data that won't change263264**Example 2: Requires understanding context without keyword matching (Project Management MCP)**265```xml266<qa_pair>267<question>Locate the initiative focused on improving customer onboarding that was completed in late 2023. The project lead created a retrospective document after completion. What was the lead's role title at that time?</question>268<answer>Product Manager</answer>269</qa_pair>270```271272This question is good because:273- Doesn't use specific project name ("initiative focused on improving customer onboarding")274- Requires finding completed projects from specific timeframe275- Needs to identify the project lead and their role276- Requires understanding context from retrospective documents277- Answer is human-readable and stable278- Based on completed work (won't change)279280**Example 3: Complex aggregation requiring multiple steps (Issue Tracker MCP)**281```xml282<qa_pair>283<question>Among all bugs reported in January 2024 that were marked as critical priority, which assignee resolved the highest percentage of their assigned bugs within 48 hours? Provide the assignee's username.</question>284<answer>alex_eng</answer>285</qa_pair>286```287288This question is good because:289- Requires filtering bugs by date, priority, and status290- Needs to group by assignee and calculate resolution rates291- Requires understanding timestamps to determine 48-hour windows292- Tests pagination (potentially many bugs to process)293- Answer is a single username294- Based on historical data from specific time period295296**Example 4: Requires synthesis across multiple data types (CRM MCP)**297```xml298<qa_pair>299<question>Find the account that upgraded from the Starter to Enterprise plan in Q4 2023 and had the highest annual contract value. What industry does this account operate in?</question>300<answer>Healthcare</answer>301</qa_pair>302```303304This question is good because:305- Requires understanding subscription tier changes306- Needs to identify upgrade events in specific timeframe307- Requires comparing contract values308- Must access account industry information309- Answer is simple and verifiable310- Based on completed historical transactions311312### Poor Questions313314**Example 1: Answer changes over time**315```xml316<qa_pair>317<question>How many open issues are currently assigned to the engineering team?</question>318<answer>47</answer>319</qa_pair>320```321322This question is poor because:323- The answer will change as issues are created, closed, or reassigned324- Not based on stable/stationary data325- Relies on "current state" which is dynamic326327**Example 2: Too easy with keyword search**328```xml329<qa_pair>330<question>Find the pull request with title "Add authentication feature" and tell me who created it.</question>331<answer>developer123</answer>332</qa_pair>333```334335This question is poor because:336- Can be solved with a straightforward keyword search for exact title337- Doesn't require deep exploration or understanding338- No synthesis or analysis needed339340**Example 3: Ambiguous answer format**341```xml342<qa_pair>343<question>List all the repositories that have Python as their primary language.</question>344<answer>repo1, repo2, repo3, data-pipeline, ml-tools</answer>345</qa_pair>346```347348This question is poor because:349- Answer is a list that could be returned in any order350- Difficult to verify with direct string comparison351- LLM might format differently (JSON array, comma-separated, newline-separated)352- Better to ask for a specific aggregate (count) or superlative (most stars)353354## Verification Process355356After creating evaluations:3573581. **Examine the XML file** to understand the schema3592. **Load each task instruction** and in parallel using the MCP server and tools, identify the correct answer by attempting to solve the task YOURSELF3603. **Flag any operations** that require WRITE or DESTRUCTIVE operations3614. **Accumulate all CORRECT answers** and replace any incorrect answers in the document3625. **Remove any `<qa_pair>`** that require WRITE or DESTRUCTIVE operations363364Remember to parallelize solving tasks to avoid running out of context, then accumulate all answers and make changes to the file at the end.365366## Tips for Creating Quality Evaluations3673681. **Think Hard and Plan Ahead** before generating tasks3692. **Parallelize Where Opportunity Arises** to speed up the process and manage context3703. **Focus on Realistic Use Cases** that humans would actually want to accomplish3714. **Create Challenging Questions** that test the limits of the MCP server's capabilities3725. **Ensure Stability** by using historical data and closed concepts3736. **Verify Answers** by solving the questions yourself using the MCP server tools3747. **Iterate and Refine** based on what you learn during the process375376---377378# Running Evaluations379380After creating your evaluation file, you can use the provided evaluation harness to test your MCP server.381382## Setup3833841. **Install Dependencies**385386```bash387pip install -r scripts/requirements.txt388```389390Or install manually:391```bash392pip install anthropic mcp393```3943952. **Set API Key**396397```bash398export ANTHROPIC_API_KEY=your_api_key_here399```400401## Evaluation File Format402403Evaluation files use XML format with `<qa_pair>` elements:404405```xml406<evaluation>407<qa_pair>408<question>Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name?</question>409<answer>Website Redesign</answer>410</qa_pair>411<qa_pair>412<question>Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username.</question>413<answer>sarah_dev</answer>414</qa_pair>415</evaluation>416```417418## Running Evaluations419420The evaluation script (`scripts/evaluation.py`) supports three transport types:421422**Important:**423- **stdio transport**: The evaluation script automatically launches and manages the MCP server process for you. Do not run the server manually.424- **sse/http transports**: You must start the MCP server separately before running the evaluation. The script connects to the already-running server at the specified URL.425426### 1. Local STDIO Server427428For locally-run MCP servers (script launches the server automatically):429430```bash431python scripts/evaluation.py \432-t stdio \433-c python \434-a my_mcp_server.py \435evaluation.xml436```437438With environment variables:439```bash440python scripts/evaluation.py \441-t stdio \442-c python \443-a my_mcp_server.py \444-e API_KEY=abc123 \445-e DEBUG=true \446evaluation.xml447```448449### 2. Server-Sent Events (SSE)450451For SSE-based MCP servers (you must start the server first):452453```bash454python scripts/evaluation.py \455-t sse \456-u https://example.com/mcp \457-H "Authorization: Bearer token123" \458-H "X-Custom-Header: value" \459evaluation.xml460```461462### 3. HTTP (Streamable HTTP)463464For HTTP-based MCP servers (you must start the server first):465466```bash467python scripts/evaluation.py \468-t http \469-u https://example.com/mcp \470-H "Authorization: Bearer token123" \471evaluation.xml472```473474## Command-Line Options475476```477usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND]478[-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL]479[-H HEADERS [HEADERS ...]] [-o OUTPUT]480eval_file481482positional arguments:483eval_file Path to evaluation XML file484485optional arguments:486-h, --help Show help message487-t, --transport Transport type: stdio, sse, or http (default: stdio)488-m, --model Claude model to use (default: claude-3-7-sonnet-20250219)489-o, --output Output file for report (default: print to stdout)490491stdio options:492-c, --command Command to run MCP server (e.g., python, node)493-a, --args Arguments for the command (e.g., server.py)494-e, --env Environment variables in KEY=VALUE format495496sse/http options:497-u, --url MCP server URL498-H, --header HTTP headers in 'Key: Value' format499```500501## Output502503The evaluation script generates a detailed report including:504505- **Summary Statistics**:506- Accuracy (correct/total)507- Average task duration508- Average tool calls per task509- Total tool calls510511- **Per-Task Results**:512- Prompt and expected response513- Actual response from the agent514- Whether the answer was correct (✅/❌)515- Duration and tool call details516- Agent's summary of its approach517- Agent's feedback on the tools518519### Save Report to File520521```bash522python scripts/evaluation.py \523-t stdio \524-c python \525-a my_server.py \526-o evaluation_report.md \527evaluation.xml528```529530## Complete Example Workflow531532Here's a complete example of creating and running an evaluation:5335341. **Create your evaluation file** (`my_evaluation.xml`):535536```xml537<evaluation>538<qa_pair>539<question>Find the user who created the most issues in January 2024. What is their username?</question>540<answer>alice_developer</answer>541</qa_pair>542<qa_pair>543<question>Among all pull requests merged in Q1 2024, which repository had the highest number? Provide the repository name.</question>544<answer>backend-api</answer>545</qa_pair>546<qa_pair>547<question>Find the project that was completed in December 2023 and had the longest duration from start to finish. How many days did it take?</question>548<answer>127</answer>549</qa_pair>550</evaluation>551```5525532. **Install dependencies**:554555```bash556pip install -r scripts/requirements.txt557export ANTHROPIC_API_KEY=your_api_key558```5595603. **Run evaluation**:561562```bash563python scripts/evaluation.py \564-t stdio \565-c python \566-a github_mcp_server.py \567-e GITHUB_TOKEN=ghp_xxx \568-o github_eval_report.md \569my_evaluation.xml570```5715724. **Review the report** in `github_eval_report.md` to:573- See which questions passed/failed574- Read the agent's feedback on your tools575- Identify areas for improvement576- Iterate on your MCP server design577578## Troubleshooting579580### Connection Errors581582If you get connection errors:583- **STDIO**: Verify the command and arguments are correct584- **SSE/HTTP**: Check the URL is accessible and headers are correct585- Ensure any required API keys are set in environment variables or headers586587### Low Accuracy588589If many evaluations fail:590- Review the agent's feedback for each task591- Check if tool descriptions are clear and comprehensive592- Verify input parameters are well-documented593- Consider whether tools return too much or too little data594- Ensure error messages are actionable595596### Timeout Issues597598If tasks are timing out:599- Use a more capable model (e.g., `claude-3-7-sonnet-20250219`)600- Check if tools are returning too much data601- Verify pagination is working correctly602- Consider simplifying complex questions