Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
skills/project-development/SKILL.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown292 linesFree
skills/project-development/SKILL.md
1---
2name: project-development
3description: This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches.
4---
5 
6# Project Development Methodology
7 
8This skill covers the principles for identifying tasks suited to LLM processing, designing effective project architectures, and iterating rapidly using agent-assisted development. The methodology applies whether building a batch processing pipeline, a multi-agent research system, or an interactive agent application.
9 
10## When to Activate
11 
12Activate this skill when:
13- Starting a new project that might benefit from LLM processing
14- Evaluating whether a task is well-suited for agents versus traditional code
15- Designing the architecture for an LLM-powered application
16- Planning a batch processing pipeline with structured outputs
17- Choosing between single-agent and multi-agent approaches
18- Estimating costs and timelines for LLM-heavy projects
19 
20## Core Concepts
21 
22### Task-Model Fit Recognition
23 
24Evaluate task-model fit before writing any code, because building automation on a fundamentally mismatched task wastes days of effort. Run every proposed task through these two tables to decide proceed-or-stop.
25 
26**Proceed when the task has these characteristics:**
27 
28| Characteristic | Rationale |
29|----------------|-----------|
30| Synthesis across sources | LLMs combine information from multiple inputs better than rule-based alternatives |
31| Subjective judgment with rubrics | Grading, evaluation, and classification with criteria map naturally to language reasoning |
32| Natural language output | When the goal is human-readable text, LLMs deliver it natively |
33| Error tolerance | Individual failures do not break the overall system, so LLM non-determinism is acceptable |
34| Batch processing | No conversational state required between items, which keeps context clean |
35| Domain knowledge in training | The model already has relevant context, reducing prompt engineering overhead |
36 
37**Stop when the task has these characteristics:**
38 
39| Characteristic | Rationale |
40|----------------|-----------|
41| Precise computation | Math, counting, and exact algorithms are unreliable in language models |
42| Real-time requirements | LLM latency is too high for sub-second responses |
43| Perfect accuracy requirements | Hallucination risk makes 100% accuracy impossible |
44| Proprietary data dependence | The model lacks necessary context and cannot acquire it from prompts alone |
45| Sequential dependencies | Each step depends heavily on the previous result, compounding errors |
46| Deterministic output requirements | Same input must produce identical output, which LLMs cannot guarantee |
47 
48### The Manual Prototype Step
49 
50Always validate task-model fit with a manual test before investing in automation. Copy one representative input into the model interface, evaluate the output quality, and use the result to answer these questions:
51 
52- Does the model have the knowledge required for this task?
53- Can the model produce output in the format needed?
54- What level of quality should be expected at scale?
55- Are there obvious failure modes to address?
56 
57Do this because a failed manual prototype predicts a failed automated system, while a successful one provides both a quality baseline and a prompt-design template. The test takes minutes and prevents hours of wasted development.
58 
59### Pipeline Architecture
60 
61Structure LLM projects as staged pipelines because separation of deterministic and non-deterministic stages enables fast iteration and cost control. Design each stage to be:
62 
63- **Discrete**: Clear boundaries between stages so each can be debugged independently
64- **Idempotent**: Re-running produces the same result, preventing duplicate work
65- **Cacheable**: Intermediate results persist to disk, avoiding expensive re-computation
66- **Independent**: Each stage can run separately, enabling selective re-execution
67 
68**Use this canonical pipeline structure:**
69 
70```
71acquire -> prepare -> process -> parse -> render
72```
73 
741. **Acquire**: Fetch raw data from sources (APIs, files, databases)
752. **Prepare**: Transform data into prompt format
763. **Process**: Execute LLM calls (the expensive, non-deterministic step)
774. **Parse**: Extract structured data from LLM outputs
785. **Render**: Generate final outputs (reports, files, visualizations)
79 
80Stages 1, 2, 4, and 5 are deterministic. Stage 3 is non-deterministic and expensive. Maintain this separation because it allows re-running the expensive LLM stage only when necessary, while iterating quickly on parsing and rendering.
81 
82### File System as State Machine
83 
84Use the file system to track pipeline state rather than databases or in-memory structures, because file existence provides natural idempotency and human-readable debugging.
85 
86```
87data/{id}/
88  raw.json         # acquire stage complete
89  prompt.md        # prepare stage complete
90  response.md      # process stage complete
91  parsed.json      # parse stage complete
92```
93 
94Check if an item needs processing by checking whether the output file exists. Re-run a stage by deleting its output file and downstream files. Debug by reading the intermediate files directly. This pattern works because each directory is independent, enabling simple parallelization and trivial caching.
95 
96### Structured Output Design
97 
98Design prompts for structured, parseable outputs because prompt design directly determines parsing reliability. Include these elements in every structured prompt:
99 
1001. **Section markers**: Explicit headers or prefixes that parsers can match on
1012. **Format examples**: Show exactly what output should look like
1023. **Rationale disclosure**: State "I will be parsing this programmatically" so the model prioritizes format compliance
1034. **Constrained values**: Enumerated options, score ranges, and fixed formats
104 
105Build parsers that handle LLM output variations gracefully, because LLMs do not follow instructions perfectly. Use regex patterns flexible enough for minor formatting variations, provide sensible defaults when sections are missing, and log parsing failures for review rather than crashing.
106 
107### Agent-Assisted Development
108 
109Use agent-capable models to accelerate development through rapid iteration: describe the project goal and constraints, let the agent generate initial implementation, test and iterate on specific failures, then refine prompts and architecture based on results.
110 
111Adopt these practices because they keep agent output focused and high-quality:
112- Provide clear, specific requirements upfront to reduce revision cycles
113- Break large projects into discrete components so each can be validated independently
114- Test each component before moving to the next to catch failures early
115- Keep the agent focused on one task at a time to prevent context degradation
116 
117### Cost and Scale Estimation
118 
119Estimate LLM processing costs before starting, because token costs compound quickly at scale and late discovery of budget overruns forces costly rework. Use this formula:
120 
121```
122Total cost = (items x tokens_per_item x price_per_token) + API overhead
123```
124 
125For batch processing, estimate input tokens per item (prompt + context), estimate output tokens per item (typical response length), multiply by item count, and add 20-30% buffer for retries and failures.
126 
127Track actual costs during development. If costs exceed estimates significantly, reduce context length through truncation, use smaller models for simpler items, cache and reuse partial results, or add parallel processing to reduce wall-clock time.
128 
129## Detailed Topics
130 
131### Choosing Single vs Multi-Agent Architecture
132 
133Default to single-agent pipelines for batch processing with independent items, because they are simpler to manage, cheaper to run, and easier to debug. Escalate to multi-agent architectures only when one of these conditions holds:
134 
135- Parallel exploration of different aspects is required
136- The task exceeds single context window capacity
137- Specialized sub-agents demonstrably improve quality on benchmarks
138 
139Choose multi-agent for context isolation, not role anthropomorphization. Sub-agents get fresh context windows for focused subtasks, which prevents context degradation on long-running tasks.
140 
141See `multi-agent-patterns` skill for detailed architecture guidance.
142 
143### Architectural Reduction
144 
145Start with minimal architecture and add complexity only when production evidence proves it necessary, because over-engineered scaffolding often constrains rather than enables model performance.
146 
147Vercel's d0 agent achieved 100% success rate (up from 80%) by reducing from 17 specialized tools to 2 primitives: bash command execution and SQL. The file system agent pattern uses standard Unix utilities (grep, cat, find, ls) instead of custom exploration tools.
148 
149**Reduce when:**
150- The data layer is well-documented and consistently structured
151- The model has sufficient reasoning capability
152- Specialized tools are constraining rather than enabling
153- More time is spent maintaining scaffolding than improving outcomes
154 
155**Add complexity when:**
156- The underlying data is messy, inconsistent, or poorly documented
157- The domain requires specialized knowledge the model lacks
158- Safety constraints require limiting agent capabilities
159- Operations are truly complex and benefit from structured workflows
160 
161See `tool-design` skill for detailed tool architecture guidance.
162 
163### Iteration and Refactoring
164 
165Plan for multiple architectural iterations from the start, because production agent systems at scale always require refactoring. Manus refactored their agent framework five times since launch. The Bitter Lesson suggests that structures added for current model limitations become constraints as models improve.
166 
167Build for change by following these practices:
168- Keep architecture simple and unopinionated so refactoring is cheap
169- Test across model generations to verify the harness is not limiting performance
170- Design systems that benefit from model improvements rather than locking in limitations
171 
172## Practical Guidance
173 
174### Project Planning Template
175 
176Follow this template in order, because each step validates assumptions before the next step invests effort.
177 
1781. **Task Analysis**
179   - Define the input and desired output explicitly
180   - Classify: synthesis, generation, classification, or analysis
181   - Set an acceptable error rate based on business impact
182   - Estimate the value per successful completion to justify costs
183 
1842. **Manual Validation**
185   - Test one representative example with the target model
186   - Evaluate output quality and format against requirements
187   - Identify failure modes that need parser hardening or prompt revision
188   - Estimate tokens per item for cost projection
189 
1903. **Architecture Selection**
191   - Choose single pipeline vs multi-agent based on the criteria above
192   - Identify required tools and data sources
193   - Design storage and caching strategy using file-system state
194   - Plan parallelization approach for the process stage
195 
1964. **Cost Estimation**
197   - Calculate items x tokens x price with a 20-30% buffer
198   - Estimate development time for each pipeline stage
199   - Identify infrastructure requirements (API keys, storage, compute)
200   - Project ongoing operational costs for production runs
201 
2025. **Development Plan**
203   - Implement stage-by-stage, testing each before proceeding
204   - Define a testing strategy per stage with expected outputs
205   - Set iteration milestones tied to quality metrics
206   - Plan deployment approach with rollback capability
207 
208## Examples
209 
210**Example 1: Batch Analysis Pipeline (Karpathy's HN Time Capsule)**
211 
212Task: Analyze 930 HN discussions from 10 years ago with hindsight grading.
213 
214Architecture:
215- 5-stage pipeline: fetch -> prompt -> analyze -> parse -> render
216- File system state: data/{date}/{item_id}/ with stage output files
217- Structured output: 6 sections with explicit format requirements
218- Parallel execution: 15 workers for LLM calls
219 
220Results: $58 total cost, ~1 hour execution, static HTML output.
221 
222**Example 2: Architectural Reduction (Vercel d0)**
223 
224Task: Text-to-SQL agent for internal analytics.
225 
226Before: 17 specialized tools, 80% success rate, 274s average execution.
227 
228After: 2 tools (bash + SQL), 100% success rate, 77s average execution.
229 
230Key insight: The semantic layer was already good documentation. Claude just needed access to read files directly.
231 
232See [Case Studies](./references/case-studies.md) for detailed analysis.
233 
234## Guidelines
235 
2361. Validate task-model fit with manual prototyping before building automation
2372. Structure pipelines as discrete, idempotent, cacheable stages
2383. Use the file system for state management and debugging
2394. Design prompts for structured, parseable outputs with explicit format examples
2405. Start with minimal architecture; add complexity only when proven necessary
2416. Estimate costs early and track throughout development
2427. Build robust parsers that handle LLM output variations
2438. Expect and plan for multiple architectural iterations
2449. Test whether scaffolding helps or constrains model performance
24510. Use agent-assisted development for rapid iteration on implementation
246 
247## Gotchas
248 
2491. **Skipping manual validation**: Building automation before verifying the model can do the task wastes significant time when the approach is fundamentally flawed. Always run one representative example through the model interface first.
2502. **Monolithic pipelines**: Combining all stages into one script makes debugging and iteration difficult. Separate stages with persistent intermediate outputs so each can be re-run independently.
2513. **Over-constraining the model**: Adding guardrails, pre-filtering, and validation logic that the model could handle on its own reduces performance. Test whether scaffolding helps or hurts before keeping it.
2524. **Ignoring costs until production**: Token costs compound quickly at scale. Estimate and track from the beginning to avoid budget surprises that force architectural rework.
2535. **Perfect parsing requirements**: Expecting LLMs to follow format instructions perfectly leads to brittle systems. Build robust parsers that handle variations and log failures for review.
2546. **Premature optimization**: Adding caching, parallelization, and optimization before the basic pipeline works correctly wastes effort on code that may be discarded during iteration.
2557. **Model version lock-in**: Building pipelines that only work with one specific model version creates fragile systems. Test across model generations and abstract the LLM call layer so models can be swapped without rewriting pipeline logic.
2568. **Evaluation-less deployment**: Shipping agent pipelines without measuring output quality means regressions go undetected. Define quality metrics during development and run evaluation checks before and after every model or prompt change.
257 
258## Integration
259 
260This skill connects to:
261- context-fundamentals - Understanding context constraints for prompt design
262- tool-design - Designing tools for agent systems within pipelines
263- multi-agent-patterns - When to use multi-agent versus single pipelines
264- evaluation - Evaluating pipeline outputs and agent performance
265- context-compression - Managing context when pipelines exceed limits
266 
267## References
268 
269Internal references:
270- [Case Studies](./references/case-studies.md) - Read when: evaluating architecture tradeoffs or reviewing real-world pipeline implementations (Karpathy HN Capsule, Vercel d0, Manus patterns)
271- [Pipeline Patterns](./references/pipeline-patterns.md) - Read when: designing a new pipeline stage layout, choosing caching strategies, or debugging stage boundaries
272 
273Related skills in this collection:
274- tool-design - Tool architecture and reduction patterns
275- multi-agent-patterns - When to use multi-agent architectures
276- evaluation - Output evaluation frameworks
277 
278External resources:
279- Karpathy's HN Time Capsule project: https://github.com/karpathy/hn-time-capsule
280- Vercel d0 architectural reduction: https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools
281- Manus context engineering: Peak Ji's blog on context engineering lessons
282- Anthropic multi-agent research: How we built our multi-agent research system
283 
284---
285 
286## Skill Metadata
287 
288**Created**: 2025-12-25
289**Last Updated**: 2026-03-17
290**Author**: Agent Skills for Context Engineering Contributors
291**Version**: 1.1.0
292
Preparing the source view

Agent Skills for Context Engineering

skills/project-development/SKILL.md