Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
skills/project-development/SKILL.md
1---2name: project-development3description: This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches.4---56# Project Development Methodology78This skill covers the principles for identifying tasks suited to LLM processing, designing effective project architectures, and iterating rapidly using agent-assisted development. The methodology applies whether building a batch processing pipeline, a multi-agent research system, or an interactive agent application.910## When to Activate1112Activate this skill when:13- Starting a new project that might benefit from LLM processing14- Evaluating whether a task is well-suited for agents versus traditional code15- Designing the architecture for an LLM-powered application16- Planning a batch processing pipeline with structured outputs17- Choosing between single-agent and multi-agent approaches18- Estimating costs and timelines for LLM-heavy projects1920## Core Concepts2122### Task-Model Fit Recognition2324Evaluate task-model fit before writing any code, because building automation on a fundamentally mismatched task wastes days of effort. Run every proposed task through these two tables to decide proceed-or-stop.2526**Proceed when the task has these characteristics:**2728| Characteristic | Rationale |29|----------------|-----------|30| Synthesis across sources | LLMs combine information from multiple inputs better than rule-based alternatives |31| Subjective judgment with rubrics | Grading, evaluation, and classification with criteria map naturally to language reasoning |32| Natural language output | When the goal is human-readable text, LLMs deliver it natively |33| Error tolerance | Individual failures do not break the overall system, so LLM non-determinism is acceptable |34| Batch processing | No conversational state required between items, which keeps context clean |35| Domain knowledge in training | The model already has relevant context, reducing prompt engineering overhead |3637**Stop when the task has these characteristics:**3839| Characteristic | Rationale |40|----------------|-----------|41| Precise computation | Math, counting, and exact algorithms are unreliable in language models |42| Real-time requirements | LLM latency is too high for sub-second responses |43| Perfect accuracy requirements | Hallucination risk makes 100% accuracy impossible |44| Proprietary data dependence | The model lacks necessary context and cannot acquire it from prompts alone |45| Sequential dependencies | Each step depends heavily on the previous result, compounding errors |46| Deterministic output requirements | Same input must produce identical output, which LLMs cannot guarantee |4748### The Manual Prototype Step4950Always validate task-model fit with a manual test before investing in automation. Copy one representative input into the model interface, evaluate the output quality, and use the result to answer these questions:5152- Does the model have the knowledge required for this task?53- Can the model produce output in the format needed?54- What level of quality should be expected at scale?55- Are there obvious failure modes to address?5657Do this because a failed manual prototype predicts a failed automated system, while a successful one provides both a quality baseline and a prompt-design template. The test takes minutes and prevents hours of wasted development.5859### Pipeline Architecture6061Structure LLM projects as staged pipelines because separation of deterministic and non-deterministic stages enables fast iteration and cost control. Design each stage to be:6263- **Discrete**: Clear boundaries between stages so each can be debugged independently64- **Idempotent**: Re-running produces the same result, preventing duplicate work65- **Cacheable**: Intermediate results persist to disk, avoiding expensive re-computation66- **Independent**: Each stage can run separately, enabling selective re-execution6768**Use this canonical pipeline structure:**6970```71acquire -> prepare -> process -> parse -> render72```73741. **Acquire**: Fetch raw data from sources (APIs, files, databases)752. **Prepare**: Transform data into prompt format763. **Process**: Execute LLM calls (the expensive, non-deterministic step)774. **Parse**: Extract structured data from LLM outputs785. **Render**: Generate final outputs (reports, files, visualizations)7980Stages 1, 2, 4, and 5 are deterministic. Stage 3 is non-deterministic and expensive. Maintain this separation because it allows re-running the expensive LLM stage only when necessary, while iterating quickly on parsing and rendering.8182### File System as State Machine8384Use the file system to track pipeline state rather than databases or in-memory structures, because file existence provides natural idempotency and human-readable debugging.8586```87data/{id}/88raw.json # acquire stage complete89prompt.md # prepare stage complete90response.md # process stage complete91parsed.json # parse stage complete92```9394Check if an item needs processing by checking whether the output file exists. Re-run a stage by deleting its output file and downstream files. Debug by reading the intermediate files directly. This pattern works because each directory is independent, enabling simple parallelization and trivial caching.9596### Structured Output Design9798Design prompts for structured, parseable outputs because prompt design directly determines parsing reliability. Include these elements in every structured prompt:991001. **Section markers**: Explicit headers or prefixes that parsers can match on1012. **Format examples**: Show exactly what output should look like1023. **Rationale disclosure**: State "I will be parsing this programmatically" so the model prioritizes format compliance1034. **Constrained values**: Enumerated options, score ranges, and fixed formats104105Build parsers that handle LLM output variations gracefully, because LLMs do not follow instructions perfectly. Use regex patterns flexible enough for minor formatting variations, provide sensible defaults when sections are missing, and log parsing failures for review rather than crashing.106107### Agent-Assisted Development108109Use agent-capable models to accelerate development through rapid iteration: describe the project goal and constraints, let the agent generate initial implementation, test and iterate on specific failures, then refine prompts and architecture based on results.110111Adopt these practices because they keep agent output focused and high-quality:112- Provide clear, specific requirements upfront to reduce revision cycles113- Break large projects into discrete components so each can be validated independently114- Test each component before moving to the next to catch failures early115- Keep the agent focused on one task at a time to prevent context degradation116117### Cost and Scale Estimation118119Estimate LLM processing costs before starting, because token costs compound quickly at scale and late discovery of budget overruns forces costly rework. Use this formula:120121```122Total cost = (items x tokens_per_item x price_per_token) + API overhead123```124125For batch processing, estimate input tokens per item (prompt + context), estimate output tokens per item (typical response length), multiply by item count, and add 20-30% buffer for retries and failures.126127Track actual costs during development. If costs exceed estimates significantly, reduce context length through truncation, use smaller models for simpler items, cache and reuse partial results, or add parallel processing to reduce wall-clock time.128129## Detailed Topics130131### Choosing Single vs Multi-Agent Architecture132133Default to single-agent pipelines for batch processing with independent items, because they are simpler to manage, cheaper to run, and easier to debug. Escalate to multi-agent architectures only when one of these conditions holds:134135- Parallel exploration of different aspects is required136- The task exceeds single context window capacity137- Specialized sub-agents demonstrably improve quality on benchmarks138139Choose multi-agent for context isolation, not role anthropomorphization. Sub-agents get fresh context windows for focused subtasks, which prevents context degradation on long-running tasks.140141See `multi-agent-patterns` skill for detailed architecture guidance.142143### Architectural Reduction144145Start with minimal architecture and add complexity only when production evidence proves it necessary, because over-engineered scaffolding often constrains rather than enables model performance.146147Vercel's d0 agent achieved 100% success rate (up from 80%) by reducing from 17 specialized tools to 2 primitives: bash command execution and SQL. The file system agent pattern uses standard Unix utilities (grep, cat, find, ls) instead of custom exploration tools.148149**Reduce when:**150- The data layer is well-documented and consistently structured151- The model has sufficient reasoning capability152- Specialized tools are constraining rather than enabling153- More time is spent maintaining scaffolding than improving outcomes154155**Add complexity when:**156- The underlying data is messy, inconsistent, or poorly documented157- The domain requires specialized knowledge the model lacks158- Safety constraints require limiting agent capabilities159- Operations are truly complex and benefit from structured workflows160161See `tool-design` skill for detailed tool architecture guidance.162163### Iteration and Refactoring164165Plan for multiple architectural iterations from the start, because production agent systems at scale always require refactoring. Manus refactored their agent framework five times since launch. The Bitter Lesson suggests that structures added for current model limitations become constraints as models improve.166167Build for change by following these practices:168- Keep architecture simple and unopinionated so refactoring is cheap169- Test across model generations to verify the harness is not limiting performance170- Design systems that benefit from model improvements rather than locking in limitations171172## Practical Guidance173174### Project Planning Template175176Follow this template in order, because each step validates assumptions before the next step invests effort.1771781. **Task Analysis**179- Define the input and desired output explicitly180- Classify: synthesis, generation, classification, or analysis181- Set an acceptable error rate based on business impact182- Estimate the value per successful completion to justify costs1831842. **Manual Validation**185- Test one representative example with the target model186- Evaluate output quality and format against requirements187- Identify failure modes that need parser hardening or prompt revision188- Estimate tokens per item for cost projection1891903. **Architecture Selection**191- Choose single pipeline vs multi-agent based on the criteria above192- Identify required tools and data sources193- Design storage and caching strategy using file-system state194- Plan parallelization approach for the process stage1951964. **Cost Estimation**197- Calculate items x tokens x price with a 20-30% buffer198- Estimate development time for each pipeline stage199- Identify infrastructure requirements (API keys, storage, compute)200- Project ongoing operational costs for production runs2012025. **Development Plan**203- Implement stage-by-stage, testing each before proceeding204- Define a testing strategy per stage with expected outputs205- Set iteration milestones tied to quality metrics206- Plan deployment approach with rollback capability207208## Examples209210**Example 1: Batch Analysis Pipeline (Karpathy's HN Time Capsule)**211212Task: Analyze 930 HN discussions from 10 years ago with hindsight grading.213214Architecture:215- 5-stage pipeline: fetch -> prompt -> analyze -> parse -> render216- File system state: data/{date}/{item_id}/ with stage output files217- Structured output: 6 sections with explicit format requirements218- Parallel execution: 15 workers for LLM calls219220Results: $58 total cost, ~1 hour execution, static HTML output.221222**Example 2: Architectural Reduction (Vercel d0)**223224Task: Text-to-SQL agent for internal analytics.225226Before: 17 specialized tools, 80% success rate, 274s average execution.227228After: 2 tools (bash + SQL), 100% success rate, 77s average execution.229230Key insight: The semantic layer was already good documentation. Claude just needed access to read files directly.231232See [Case Studies](./references/case-studies.md) for detailed analysis.233234## Guidelines2352361. Validate task-model fit with manual prototyping before building automation2372. Structure pipelines as discrete, idempotent, cacheable stages2383. Use the file system for state management and debugging2394. Design prompts for structured, parseable outputs with explicit format examples2405. Start with minimal architecture; add complexity only when proven necessary2416. Estimate costs early and track throughout development2427. Build robust parsers that handle LLM output variations2438. Expect and plan for multiple architectural iterations2449. Test whether scaffolding helps or constrains model performance24510. Use agent-assisted development for rapid iteration on implementation246247## Gotchas2482491. **Skipping manual validation**: Building automation before verifying the model can do the task wastes significant time when the approach is fundamentally flawed. Always run one representative example through the model interface first.2502. **Monolithic pipelines**: Combining all stages into one script makes debugging and iteration difficult. Separate stages with persistent intermediate outputs so each can be re-run independently.2513. **Over-constraining the model**: Adding guardrails, pre-filtering, and validation logic that the model could handle on its own reduces performance. Test whether scaffolding helps or hurts before keeping it.2524. **Ignoring costs until production**: Token costs compound quickly at scale. Estimate and track from the beginning to avoid budget surprises that force architectural rework.2535. **Perfect parsing requirements**: Expecting LLMs to follow format instructions perfectly leads to brittle systems. Build robust parsers that handle variations and log failures for review.2546. **Premature optimization**: Adding caching, parallelization, and optimization before the basic pipeline works correctly wastes effort on code that may be discarded during iteration.2557. **Model version lock-in**: Building pipelines that only work with one specific model version creates fragile systems. Test across model generations and abstract the LLM call layer so models can be swapped without rewriting pipeline logic.2568. **Evaluation-less deployment**: Shipping agent pipelines without measuring output quality means regressions go undetected. Define quality metrics during development and run evaluation checks before and after every model or prompt change.257258## Integration259260This skill connects to:261- context-fundamentals - Understanding context constraints for prompt design262- tool-design - Designing tools for agent systems within pipelines263- multi-agent-patterns - When to use multi-agent versus single pipelines264- evaluation - Evaluating pipeline outputs and agent performance265- context-compression - Managing context when pipelines exceed limits266267## References268269Internal references:270- [Case Studies](./references/case-studies.md) - Read when: evaluating architecture tradeoffs or reviewing real-world pipeline implementations (Karpathy HN Capsule, Vercel d0, Manus patterns)271- [Pipeline Patterns](./references/pipeline-patterns.md) - Read when: designing a new pipeline stage layout, choosing caching strategies, or debugging stage boundaries272273Related skills in this collection:274- tool-design - Tool architecture and reduction patterns275- multi-agent-patterns - When to use multi-agent architectures276- evaluation - Output evaluation frameworks277278External resources:279- Karpathy's HN Time Capsule project: https://github.com/karpathy/hn-time-capsule280- Vercel d0 architectural reduction: https://vercel.com/blog/we-removed-80-percent-of-our-agents-tools281- Manus context engineering: Peak Ji's blog on context engineering lessons282- Anthropic multi-agent research: How we built our multi-agent research system283284---285286## Skill Metadata287288**Created**: 2025-12-25289**Last Updated**: 2026-03-17290**Author**: Agent Skills for Context Engineering Contributors291**Version**: 1.1.0292