Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Create, test, and iteratively improve Claude skills with eval benchmarks and description optimization
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
references/schemas.md
1# JSON Schemas23This document defines the JSON schemas used by skill-creator.45---67## evals.json89Defines the evals for a skill. Located at `evals/evals.json` within the skill directory.1011```json12{13"skill_name": "example-skill",14"evals": [15{16"id": 1,17"prompt": "User's example prompt",18"expected_output": "Description of expected result",19"files": ["evals/files/sample1.pdf"],20"expectations": [21"The output includes X",22"The skill used script Y"23]24}25]26}27```2829**Fields:**30- `skill_name`: Name matching the skill's frontmatter31- `evals[].id`: Unique integer identifier32- `evals[].prompt`: The task to execute33- `evals[].expected_output`: Human-readable description of success34- `evals[].files`: Optional list of input file paths (relative to skill root)35- `evals[].expectations`: List of verifiable statements3637---3839## history.json4041Tracks version progression in Improve mode. Located at workspace root.4243```json44{45"started_at": "2026-01-15T10:30:00Z",46"skill_name": "pdf",47"current_best": "v2",48"iterations": [49{50"version": "v0",51"parent": null,52"expectation_pass_rate": 0.65,53"grading_result": "baseline",54"is_current_best": false55},56{57"version": "v1",58"parent": "v0",59"expectation_pass_rate": 0.75,60"grading_result": "won",61"is_current_best": false62},63{64"version": "v2",65"parent": "v1",66"expectation_pass_rate": 0.85,67"grading_result": "won",68"is_current_best": true69}70]71}72```7374**Fields:**75- `started_at`: ISO timestamp of when improvement started76- `skill_name`: Name of the skill being improved77- `current_best`: Version identifier of the best performer78- `iterations[].version`: Version identifier (v0, v1, ...)79- `iterations[].parent`: Parent version this was derived from80- `iterations[].expectation_pass_rate`: Pass rate from grading81- `iterations[].grading_result`: "baseline", "won", "lost", or "tie"82- `iterations[].is_current_best`: Whether this is the current best version8384---8586## grading.json8788Output from the grader agent. Located at `<run-dir>/grading.json`.8990```json91{92"expectations": [93{94"text": "The output includes the name 'John Smith'",95"passed": true,96"evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"97},98{99"text": "The spreadsheet has a SUM formula in cell B10",100"passed": false,101"evidence": "No spreadsheet was created. The output was a text file."102}103],104"summary": {105"passed": 2,106"failed": 1,107"total": 3,108"pass_rate": 0.67109},110"execution_metrics": {111"tool_calls": {112"Read": 5,113"Write": 2,114"Bash": 8115},116"total_tool_calls": 15,117"total_steps": 6,118"errors_encountered": 0,119"output_chars": 12450,120"transcript_chars": 3200121},122"timing": {123"executor_duration_seconds": 165.0,124"grader_duration_seconds": 26.0,125"total_duration_seconds": 191.0126},127"claims": [128{129"claim": "The form has 12 fillable fields",130"type": "factual",131"verified": true,132"evidence": "Counted 12 fields in field_info.json"133}134],135"user_notes_summary": {136"uncertainties": ["Used 2023 data, may be stale"],137"needs_review": [],138"workarounds": ["Fell back to text overlay for non-fillable fields"]139},140"eval_feedback": {141"suggestions": [142{143"assertion": "The output includes the name 'John Smith'",144"reason": "A hallucinated document that mentions the name would also pass"145}146],147"overall": "Assertions check presence but not correctness."148}149}150```151152**Fields:**153- `expectations[]`: Graded expectations with evidence154- `summary`: Aggregate pass/fail counts155- `execution_metrics`: Tool usage and output size (from executor's metrics.json)156- `timing`: Wall clock timing (from timing.json)157- `claims`: Extracted and verified claims from the output158- `user_notes_summary`: Issues flagged by the executor159- `eval_feedback`: (optional) Improvement suggestions for the evals, only present when the grader identifies issues worth raising160161---162163## metrics.json164165Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.166167```json168{169"tool_calls": {170"Read": 5,171"Write": 2,172"Bash": 8,173"Edit": 1,174"Glob": 2,175"Grep": 0176},177"total_tool_calls": 18,178"total_steps": 6,179"files_created": ["filled_form.pdf", "field_values.json"],180"errors_encountered": 0,181"output_chars": 12450,182"transcript_chars": 3200183}184```185186**Fields:**187- `tool_calls`: Count per tool type188- `total_tool_calls`: Sum of all tool calls189- `total_steps`: Number of major execution steps190- `files_created`: List of output files created191- `errors_encountered`: Number of errors during execution192- `output_chars`: Total character count of output files193- `transcript_chars`: Character count of transcript194195---196197## timing.json198199Wall clock timing for a run. Located at `<run-dir>/timing.json`.200201**How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.202203```json204{205"total_tokens": 84852,206"duration_ms": 23332,207"total_duration_seconds": 23.3,208"executor_start": "2026-01-15T10:30:00Z",209"executor_end": "2026-01-15T10:32:45Z",210"executor_duration_seconds": 165.0,211"grader_start": "2026-01-15T10:32:46Z",212"grader_end": "2026-01-15T10:33:12Z",213"grader_duration_seconds": 26.0214}215```216217---218219## benchmark.json220221Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.222223```json224{225"metadata": {226"skill_name": "pdf",227"skill_path": "/path/to/pdf",228"executor_model": "claude-sonnet-4-20250514",229"analyzer_model": "most-capable-model",230"timestamp": "2026-01-15T10:30:00Z",231"evals_run": [1, 2, 3],232"runs_per_configuration": 3233},234235"runs": [236{237"eval_id": 1,238"eval_name": "Ocean",239"configuration": "with_skill",240"run_number": 1,241"result": {242"pass_rate": 0.85,243"passed": 6,244"failed": 1,245"total": 7,246"time_seconds": 42.5,247"tokens": 3800,248"tool_calls": 18,249"errors": 0250},251"expectations": [252{"text": "...", "passed": true, "evidence": "..."}253],254"notes": [255"Used 2023 data, may be stale",256"Fell back to text overlay for non-fillable fields"257]258}259],260261"run_summary": {262"with_skill": {263"pass_rate": {"mean": 0.85, "stddev": 0.05, "min": 0.80, "max": 0.90},264"time_seconds": {"mean": 45.0, "stddev": 12.0, "min": 32.0, "max": 58.0},265"tokens": {"mean": 3800, "stddev": 400, "min": 3200, "max": 4100}266},267"without_skill": {268"pass_rate": {"mean": 0.35, "stddev": 0.08, "min": 0.28, "max": 0.45},269"time_seconds": {"mean": 32.0, "stddev": 8.0, "min": 24.0, "max": 42.0},270"tokens": {"mean": 2100, "stddev": 300, "min": 1800, "max": 2500}271},272"delta": {273"pass_rate": "+0.50",274"time_seconds": "+13.0",275"tokens": "+1700"276}277},278279"notes": [280"Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",281"Eval 3 shows high variance (50% ± 40%) - may be flaky or model-dependent",282"Without-skill runs consistently fail on table extraction expectations",283"Skill adds 13s average execution time but improves pass rate by 50%"284]285}286```287288**Fields:**289- `metadata`: Information about the benchmark run290- `skill_name`: Name of the skill291- `timestamp`: When the benchmark was run292- `evals_run`: List of eval names or IDs293- `runs_per_configuration`: Number of runs per config (e.g. 3)294- `runs[]`: Individual run results295- `eval_id`: Numeric eval identifier296- `eval_name`: Human-readable eval name (used as section header in the viewer)297- `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)298- `run_number`: Integer run number (1, 2, 3...)299- `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`300- `run_summary`: Statistical aggregates per configuration301- `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields302- `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`303- `notes`: Freeform observations from the analyzer304305**Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.306307---308309## comparison.json310311Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.312313```json314{315"winner": "A",316"reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",317"rubric": {318"A": {319"content": {320"correctness": 5,321"completeness": 5,322"accuracy": 4323},324"structure": {325"organization": 4,326"formatting": 5,327"usability": 4328},329"content_score": 4.7,330"structure_score": 4.3,331"overall_score": 9.0332},333"B": {334"content": {335"correctness": 3,336"completeness": 2,337"accuracy": 3338},339"structure": {340"organization": 3,341"formatting": 2,342"usability": 3343},344"content_score": 2.7,345"structure_score": 2.7,346"overall_score": 5.4347}348},349"output_quality": {350"A": {351"score": 9,352"strengths": ["Complete solution", "Well-formatted", "All fields present"],353"weaknesses": ["Minor style inconsistency in header"]354},355"B": {356"score": 5,357"strengths": ["Readable output", "Correct basic structure"],358"weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]359}360},361"expectation_results": {362"A": {363"passed": 4,364"total": 5,365"pass_rate": 0.80,366"details": [367{"text": "Output includes name", "passed": true}368]369},370"B": {371"passed": 3,372"total": 5,373"pass_rate": 0.60,374"details": [375{"text": "Output includes name", "passed": true}376]377}378}379}380```381382---383384## analysis.json385386Output from post-hoc analyzer. Located at `<grading-dir>/analysis.json`.387388```json389{390"comparison_summary": {391"winner": "A",392"winner_skill": "path/to/winner/skill",393"loser_skill": "path/to/loser/skill",394"comparator_reasoning": "Brief summary of why comparator chose winner"395},396"winner_strengths": [397"Clear step-by-step instructions for handling multi-page documents",398"Included validation script that caught formatting errors"399],400"loser_weaknesses": [401"Vague instruction 'process the document appropriately' led to inconsistent behavior",402"No script for validation, agent had to improvise"403],404"instruction_following": {405"winner": {406"score": 9,407"issues": ["Minor: skipped optional logging step"]408},409"loser": {410"score": 6,411"issues": [412"Did not use the skill's formatting template",413"Invented own approach instead of following step 3"414]415}416},417"improvement_suggestions": [418{419"priority": "high",420"category": "instructions",421"suggestion": "Replace 'process the document appropriately' with explicit steps",422"expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"423}424],425"transcript_insights": {426"winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script",427"loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods"428}429}430```431