Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
skills/project-development/references/pipeline-patterns.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown611 linesFree
skills/project-development/references/pipeline-patterns.md
1# Pipeline Patterns for LLM Projects
2 
3This reference provides detailed patterns for structuring LLM processing pipelines. These patterns apply to batch processing, data analysis, content generation, and similar workloads.
4 
5## The Canonical Pipeline
6 
7```
8acquire → prepare → process → parse → render
9```
10 
11### Stage Characteristics
12 
13| Stage | Deterministic | Expensive | Parallelizable | Idempotent |
14|-------|---------------|-----------|----------------|------------|
15| Acquire | Yes | Low | Yes | Yes |
16| Prepare | Yes | Low | Yes | Yes |
17| Process | No | High | Yes | Yes (with caching) |
18| Parse | Yes | Low | Yes | Yes |
19| Render | Yes | Low | Partially | Yes |
20 
21The key insight: only the Process stage involves LLM calls. All other stages are deterministic transformations that can be debugged, tested, and iterated independently.
22 
23## File System State Management
24 
25### Directory Structure Pattern
26 
27```
28project/
29├── data/
30│   └── {batch_id}/
31│       └── {item_id}/
32│           ├── raw.json         # Acquire output
33│           ├── prompt.md        # Prepare output
34│           ├── response.md      # Process output
35│           └── parsed.json      # Parse output
36├── output/
37│   └── {batch_id}/
38│       └── index.html           # Render output
39└── config/
40    └── prompts/
41        └── template.md          # Prompt templates
42```
43 
44### State Checking Pattern
45 
46```python
47def needs_processing(item_dir: Path, stage: str) -> bool:
48    """Check if an item needs processing for a given stage."""
49    stage_outputs = {
50        "acquire": ["raw.json"],
51        "prepare": ["prompt.md"],
52        "process": ["response.md"],
53        "parse": ["parsed.json"],
54    }
55    
56    for output_file in stage_outputs[stage]:
57        if not (item_dir / output_file).exists():
58            return True
59    return False
60```
61 
62### Clean/Retry Pattern
63 
64```python
65def clean_from_stage(item_dir: Path, stage: str):
66    """Remove outputs from stage and all downstream stages."""
67    stage_order = ["acquire", "prepare", "process", "parse", "render"]
68    stage_outputs = {
69        "acquire": ["raw.json"],
70        "prepare": ["prompt.md"],
71        "process": ["response.md"],
72        "parse": ["parsed.json"],
73    }
74    
75    start_idx = stage_order.index(stage)
76    for s in stage_order[start_idx:]:
77        for output_file in stage_outputs.get(s, []):
78            filepath = item_dir / output_file
79            if filepath.exists():
80                filepath.unlink()
81```
82 
83## Parallel Execution Patterns
84 
85### ThreadPoolExecutor for LLM Calls
86 
87```python
88from concurrent.futures import ThreadPoolExecutor, as_completed
89 
90def process_batch(items: list, max_workers: int = 10):
91    """Process items in parallel with progress tracking."""
92    results = []
93    
94    with ThreadPoolExecutor(max_workers=max_workers) as executor:
95        futures = {executor.submit(process_item, item): item for item in items}
96        
97        for future in as_completed(futures):
98            item = futures[future]
99            try:
100                result = future.result()
101                results.append((item, result, None))
102            except Exception as e:
103                results.append((item, None, str(e)))
104    
105    return results
106```
107 
108### Batch Size Considerations
109 
110- **Small batches (1-10)**: Sequential processing is fine; overhead of parallelization not worth it
111- **Medium batches (10-100)**: Parallelize with 5-15 workers depending on API rate limits
112- **Large batches (100+)**: Consider chunking with checkpoints; implement resume capability
113 
114### Rate Limiting
115 
116```python
117import time
118from functools import wraps
119 
120def rate_limited(calls_per_second: float):
121    """Decorator to rate limit function calls."""
122    min_interval = 1.0 / calls_per_second
123    last_call = [0.0]
124    
125    def decorator(func):
126        @wraps(func)
127        def wrapper(*args, **kwargs):
128            elapsed = time.time() - last_call[0]
129            if elapsed < min_interval:
130                time.sleep(min_interval - elapsed)
131            result = func(*args, **kwargs)
132            last_call[0] = time.time()
133            return result
134        return wrapper
135    return decorator
136```
137 
138## Structured Output Patterns
139 
140### Prompt Template Structure
141 
142```markdown
143[INSTRUCTION BLOCK]
144Analyze the following content and provide your response in exactly this format.
145 
146[FORMAT SPECIFICATION]
147## Section 1: Summary
148[Your summary here - 2-3 sentences]
149 
150## Section 2: Analysis
151- Point 1
152- Point 2
153- Point 3
154 
155## Section 3: Score
156Rating: [1-10]
157Confidence: [low/medium/high]
158 
159[FORMAT ENFORCEMENT]
160Follow this format exactly because I will be parsing it programmatically.
161 
162---
163 
164[CONTENT BLOCK]
165# Title: {title}
166 
167## Content
168{content}
169 
170## Additional Context
171{context}
172```
173 
174### Parsing Patterns
175 
176**Section Extraction**
177 
178```python
179import re
180 
181def extract_section(text: str, section_name: str) -> str | None:
182    """Extract content between section headers."""
183    # Match section header with optional markdown formatting
184    pattern = rf'(?:^|\n)(?:#+ *)?{re.escape(section_name)}[:\s]*\n(.*?)(?=\n(?:#+ |\Z))'
185    match = re.search(pattern, text, re.IGNORECASE | re.DOTALL)
186    return match.group(1).strip() if match else None
187```
188 
189**Structured Field Extraction**
190 
191```python
192def extract_field(text: str, field_name: str) -> str | None:
193    """Extract value after field label."""
194    # Handle: "Field: value" or "Field - value" or "**Field**: value"
195    pattern = rf'(?:\*\*)?{re.escape(field_name)}(?:\*\*)?[\s:\-]+([^\n]+)'
196    match = re.search(pattern, text, re.IGNORECASE)
197    return match.group(1).strip() if match else None
198```
199 
200**List Extraction**
201 
202```python
203def extract_list_items(text: str, section_name: str) -> list[str]:
204    """Extract bullet points from a section."""
205    section = extract_section(text, section_name)
206    if not section:
207        return []
208    
209    # Match lines starting with -, *, or numbered
210    items = re.findall(r'^[\-\*\d\.]+\s*(.+)$', section, re.MULTILINE)
211    return [item.strip() for item in items]
212```
213 
214**Score Extraction with Validation**
215 
216```python
217def extract_score(text: str, field_name: str, min_val: int, max_val: int) -> int | None:
218    """Extract and validate numeric score."""
219    raw = extract_field(text, field_name)
220    if not raw:
221        return None
222    
223    # Extract first number from the value
224    match = re.search(r'\d+', raw)
225    if not match:
226        return None
227    
228    score = int(match.group())
229    return max(min_val, min(max_val, score))  # Clamp to valid range
230```
231 
232### Graceful Degradation
233 
234```python
235@dataclass
236class ParseResult:
237    summary: str = ""
238    score: int | None = None
239    items: list[str] = field(default_factory=list)
240    parse_errors: list[str] = field(default_factory=list)
241 
242def parse_response(text: str) -> ParseResult:
243    """Parse LLM response with graceful error handling."""
244    result = ParseResult()
245    
246    # Try each field, log errors but continue
247    try:
248        result.summary = extract_section(text, "Summary") or ""
249    except Exception as e:
250        result.parse_errors.append(f"Summary extraction failed: {e}")
251    
252    try:
253        result.score = extract_score(text, "Rating", 1, 10)
254    except Exception as e:
255        result.parse_errors.append(f"Score extraction failed: {e}")
256    
257    try:
258        result.items = extract_list_items(text, "Analysis")
259    except Exception as e:
260        result.parse_errors.append(f"Items extraction failed: {e}")
261    
262    return result
263```
264 
265## Error Handling Patterns
266 
267### Retry with Exponential Backoff
268 
269```python
270import time
271from functools import wraps
272 
273def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0):
274    """Retry decorator with exponential backoff."""
275    def decorator(func):
276        @wraps(func)
277        def wrapper(*args, **kwargs):
278            last_exception = None
279            for attempt in range(max_retries):
280                try:
281                    return func(*args, **kwargs)
282                except Exception as e:
283                    last_exception = e
284                    if attempt < max_retries - 1:
285                        delay = base_delay * (2 ** attempt)
286                        time.sleep(delay)
287            raise last_exception
288        return wrapper
289    return decorator
290```
291 
292### Error Logging Pattern
293 
294```python
295import json
296from datetime import datetime
297 
298def log_error(item_dir: Path, stage: str, error: str, context: dict = None):
299    """Log error to file for later analysis."""
300    error_file = item_dir / "errors.jsonl"
301    
302    error_record = {
303        "timestamp": datetime.now().isoformat(),
304        "stage": stage,
305        "error": error,
306        "context": context or {},
307    }
308    
309    with open(error_file, "a") as f:
310        f.write(json.dumps(error_record) + "\n")
311```
312 
313### Partial Success Handling
314 
315```python
316def process_batch_with_partial_success(items: list) -> tuple[list, list]:
317    """Process batch, separating successes from failures."""
318    successes = []
319    failures = []
320    
321    for item in items:
322        try:
323            result = process_item(item)
324            successes.append((item, result))
325        except Exception as e:
326            failures.append((item, str(e)))
327            log_error(item.directory, "process", str(e))
328    
329    # Report summary
330    print(f"Processed {len(items)} items: {len(successes)} succeeded, {len(failures)} failed")
331    
332    return successes, failures
333```
334 
335## Cost Estimation Patterns
336 
337### Token Counting
338 
339```python
340import tiktoken
341 
342def count_tokens(text: str, model: str = "gpt-4") -> int:
343    """Count tokens for cost estimation."""
344    try:
345        encoding = tiktoken.encoding_for_model(model)
346    except KeyError:
347        encoding = tiktoken.get_encoding("cl100k_base")
348    
349    return len(encoding.encode(text))
350 
351def estimate_cost(
352    input_tokens: int,
353    output_tokens: int,
354    input_price_per_mtok: float,
355    output_price_per_mtok: float,
356) -> float:
357    """Estimate cost in dollars."""
358    input_cost = (input_tokens / 1_000_000) * input_price_per_mtok
359    output_cost = (output_tokens / 1_000_000) * output_price_per_mtok
360    return input_cost + output_cost
361```
362 
363### Batch Cost Estimation
364 
365```python
366def estimate_batch_cost(
367    items: list,
368    prompt_template: str,
369    avg_output_tokens: int = 1000,
370    model_pricing: dict = None,
371) -> dict:
372    """Estimate total cost for a batch."""
373    model_pricing = model_pricing or {
374        "input_price_per_mtok": 3.00,   # Example: GPT-4 Turbo input
375        "output_price_per_mtok": 15.00,  # Example: GPT-4 Turbo output
376    }
377    
378    total_input_tokens = 0
379    for item in items:
380        prompt = format_prompt(prompt_template, item)
381        total_input_tokens += count_tokens(prompt)
382    
383    total_output_tokens = len(items) * avg_output_tokens
384    
385    estimated_cost = estimate_cost(
386        total_input_tokens,
387        total_output_tokens,
388        **model_pricing,
389    )
390    
391    return {
392        "item_count": len(items),
393        "total_input_tokens": total_input_tokens,
394        "total_output_tokens": total_output_tokens,
395        "estimated_cost_usd": estimated_cost,
396        "avg_input_tokens_per_item": total_input_tokens / len(items),
397        "cost_per_item_usd": estimated_cost / len(items),
398    }
399```
400 
401## CLI Pattern
402 
403### Standard CLI Structure
404 
405```python
406import argparse
407from datetime import date
408 
409def main():
410    parser = argparse.ArgumentParser(description="LLM Processing Pipeline")
411    
412    parser.add_argument(
413        "stage",
414        choices=["acquire", "prepare", "process", "parse", "render", "all", "clean"],
415        help="Pipeline stage to run",
416    )
417    parser.add_argument(
418        "--batch-id",
419        default=None,
420        help="Batch identifier (default: today's date)",
421    )
422    parser.add_argument(
423        "--limit",
424        type=int,
425        default=None,
426        help="Limit number of items (for testing)",
427    )
428    parser.add_argument(
429        "--workers",
430        type=int,
431        default=10,
432        help="Number of parallel workers for processing",
433    )
434    parser.add_argument(
435        "--model",
436        default="gpt-4-turbo",
437        help="Model to use for processing",
438    )
439    parser.add_argument(
440        "--dry-run",
441        action="store_true",
442        help="Estimate costs without processing",
443    )
444    parser.add_argument(
445        "--clean-stage",
446        choices=["acquire", "prepare", "process", "parse"],
447        help="For clean: only clean this stage and downstream",
448    )
449    
450    args = parser.parse_args()
451    
452    batch_id = args.batch_id or date.today().isoformat()
453    
454    if args.stage == "clean":
455        stage_clean(batch_id, args.clean_stage)
456    elif args.dry_run:
457        estimate_costs(batch_id, args.limit)
458    else:
459        run_pipeline(batch_id, args.stage, args.limit, args.workers, args.model)
460 
461if __name__ == "__main__":
462    main()
463```
464 
465## Rendering Patterns
466 
467### Static HTML Output
468 
469```python
470import html
471import json
472 
473def render_html(data: list[dict], output_path: Path, template: str):
474    """Render data to static HTML file."""
475    # Escape data for JavaScript embedding
476    data_json = json.dumps([
477        {k: html.escape(str(v)) if isinstance(v, str) else v 
478         for k, v in item.items()}
479        for item in data
480    ])
481    
482    html_content = template.replace("{{DATA_JSON}}", data_json)
483    
484    output_path.parent.mkdir(parents=True, exist_ok=True)
485    with open(output_path, "w") as f:
486        f.write(html_content)
487```
488 
489### Incremental Output
490 
491```python
492def render_incremental(items: list, output_dir: Path):
493    """Render each item as it completes, plus index."""
494    output_dir.mkdir(parents=True, exist_ok=True)
495    
496    # Render individual item pages
497    for item in items:
498        item_html = render_item(item)
499        item_path = output_dir / f"{item.id}.html"
500        with open(item_path, "w") as f:
501            f.write(item_html)
502    
503    # Render index linking to all items
504    index_html = render_index(items)
505    with open(output_dir / "index.html", "w") as f:
506        f.write(index_html)
507```
508 
509## Checkpoint and Resume Pattern
510 
511For long-running pipelines:
512 
513```python
514import json
515from pathlib import Path
516 
517class PipelineCheckpoint:
518    def __init__(self, checkpoint_file: Path):
519        self.checkpoint_file = checkpoint_file
520        self.state = self._load()
521    
522    def _load(self) -> dict:
523        if self.checkpoint_file.exists():
524            with open(self.checkpoint_file) as f:
525                return json.load(f)
526        return {"completed": [], "failed": [], "last_item": None}
527    
528    def save(self):
529        with open(self.checkpoint_file, "w") as f:
530            json.dump(self.state, f, indent=2)
531    
532    def mark_complete(self, item_id: str):
533        self.state["completed"].append(item_id)
534        self.state["last_item"] = item_id
535        self.save()
536    
537    def mark_failed(self, item_id: str, error: str):
538        self.state["failed"].append({"id": item_id, "error": error})
539        self.save()
540    
541    def get_remaining(self, all_items: list[str]) -> list[str]:
542        completed = set(self.state["completed"])
543        return [item for item in all_items if item not in completed]
544```
545 
546## Testing Patterns
547 
548### Stage Unit Tests
549 
550```python
551def test_prepare_stage():
552    """Test prompt generation independently."""
553    test_item = {"id": "test", "content": "Sample content"}
554    prompt = prepare_prompt(test_item)
555    
556    assert "Sample content" in prompt
557    assert "## Section 1" in prompt  # Format markers present
558 
559def test_parse_stage():
560    """Test parsing with known good output."""
561    test_response = """
562    ## Summary
563    This is a test summary.
564    
565    ## Score
566    Rating: 7
567    """
568    
569    result = parse_response(test_response)
570    assert result.summary == "This is a test summary."
571    assert result.score == 7
572 
573def test_parse_stage_malformed():
574    """Test parsing handles malformed output."""
575    test_response = "Some random text without sections"
576    
577    result = parse_response(test_response)
578    assert result.summary == ""
579    assert result.score is None
580    assert len(result.parse_errors) > 0
581```
582 
583### Integration Test Pattern
584 
585```python
586def test_pipeline_end_to_end():
587    """Test full pipeline with single item."""
588    test_dir = Path("test_data")
589    test_item = create_test_item()
590    
591    try:
592        # Run each stage
593        acquire_result = stage_acquire(test_dir, [test_item])
594        assert (test_dir / test_item.id / "raw.json").exists()
595        
596        prepare_result = stage_prepare(test_dir)
597        assert (test_dir / test_item.id / "prompt.md").exists()
598        
599        # Skip process stage in unit tests (costs money)
600        # Create mock response instead
601        mock_response(test_dir / test_item.id)
602        
603        parse_result = stage_parse(test_dir)
604        assert (test_dir / test_item.id / "parsed.json").exists()
605        
606    finally:
607        # Cleanup
608        shutil.rmtree(test_dir, ignore_errors=True)
609```
610 
611
Preparing the source view

Agent Skills for Context Engineering

skills/project-development/references/pipeline-patterns.md