Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
examples/book-sft-pipeline/SKILL.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown381 linesFree
examples/book-sft-pipeline/SKILL.md
1---
2name: book-sft-pipeline
3description: This skill should be used when the user asks to "fine-tune on books", "create SFT dataset", "train style model", "extract ePub text", or mentions style transfer, LoRA training, book segmentation, or author voice replication.
4version: 2.0.0
5---
6 
7# Book SFT Pipeline
8 
9A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.
10 
11## When to Activate
12 
13Activate this skill when:
14- Building fine-tuning datasets from literary works
15- Creating author-voice or style-transfer models
16- Preparing training data for Tinker or similar SFT platforms
17- Designing text segmentation pipelines for long-form content
18- Training small models (8B or less) on limited data
19 
20## Core Concepts
21 
22### The Three Pillars of Book SFT
23 
24**1. Intelligent Segmentation**
25Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.
26 
27**2. Diverse Instruction Generation**
28Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.
29 
30**3. Style Over Content**
31The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.
32 
33## Pipeline Architecture
34 
35```
36┌─────────────────────────────────────────────────────────────────┐
37│                    ORCHESTRATOR AGENT                           │
38│  Coordinates pipeline phases, manages state, handles failures   │
39└──────────────────────┬──────────────────────────────────────────┘
40                       │
41       ┌───────────────┼───────────────┬───────────────┐
42       ▼               ▼               ▼               ▼
43┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
44│  EXTRACTION  │ │ SEGMENTATION │ │  INSTRUCTION │ │   DATASET    │
45│    AGENT     │ │    AGENT     │ │    AGENT     │ │   BUILDER    │
46│ ePub → Text  │ │ Text → Chunks│ │ Chunks →     │ │ Pairs →      │
47│              │ │ 150-400 words│ │ Prompts      │ │ JSONL        │
48└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
49                       │
50       ┌───────────────┴───────────────┐
51       ▼                               ▼
52┌──────────────┐               ┌──────────────┐
53│   TRAINING   │               │  VALIDATION  │
54│    AGENT     │               │    AGENT     │
55│ LoRA on      │               │ AI detector  │
56│ Tinker       │               │ Originality  │
57└──────────────┘               └──────────────┘
58```
59 
60## Phase 1: Text Extraction
61 
62### Critical Rules
631. **Always source ePub over PDF** - OCR errors become learned patterns
642. **Use paragraph-level extraction** - Extract from `<p>` tags to preserve breaks
653. **Remove front/back matter** - Copyright and TOC pollute the dataset
66 
67```python
68# Extract text from ePub paragraphs
69from epub2 import EPub
70from bs4 import BeautifulSoup
71 
72def extract_epub(path):
73    book = EPub(path)
74    chapters = []
75    for item in book.flow:
76        html = book.get_chapter(item.id)
77        soup = BeautifulSoup(html, 'html.parser')
78        paragraphs = [p.get_text().strip() for p in soup.find_all('p')]
79        chapters.append('\n\n'.join(p for p in paragraphs if p))
80    return '\n\n'.join(chapters)
81```
82 
83## Phase 2: Intelligent Segmentation
84 
85### Smaller Chunks + Overlap
86 
87Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).
88 
89```python
90def segment(text, min_words=150, max_words=400):
91    paragraphs = text.split('\n\n')
92    chunks, buffer, buffer_words = [], [], 0
93    
94    for para in paragraphs:
95        words = len(para.split())
96        if buffer_words + words > max_words and buffer_words >= min_words:
97            chunks.append('\n\n'.join(buffer))
98            # Keep last paragraph for overlap
99            buffer = [buffer[-1], para] if buffer else [para]
100            buffer_words = sum(len(p.split()) for p in buffer)
101        else:
102            buffer.append(para)
103            buffer_words += words
104    
105    if buffer:
106        chunks.append('\n\n'.join(buffer))
107    return chunks
108```
109 
110### Expected Results
111 
112For an 86,000-word book:
113- Old method (250-650 words): ~150 chunks
114- New method (150-400 + overlap): ~300 chunks
115- With 2 variants per chunk: 600+ training examples
116 
117## Phase 3: Diverse Instruction Generation
118 
119### The Key Insight
120 
121Using a single prompt template causes memorization. Diverse templates teach the underlying style.
122 
123```python
124SYSTEM_PROMPTS = [
125    "You are an expert creative writer capable of emulating specific literary styles.",
126    "You are a literary writer with deep knowledge of classic prose styles.",
127    "You are a creative writer skilled at emulating distinctive authorial voices.",
128    "You write prose that captures the essence of modernist literature.",
129    "You are a talented writer who can channel classic American authors.",
130]
131 
132PROMPT_TEMPLATES = [
133    "Write a passage in the style of {author}: {desc}",
134    "Channel {author}'s voice to write about: {desc}",
135    "In {author}'s distinctive prose style, describe: {desc}",
136    "Write this scene as {author} would have: {desc}",
137    "Using {author}'s repetitive technique, describe: {desc}",
138    "Capture the rhythm of {author} in this passage: {desc}",
139    "Write like {author}: {desc}",
140    "In the voice of {author}, write: {desc}",
141    "This is a literary exercise. Write like {author}: {desc}",
142    "Can you write in {author}'s style? {desc}",
143]
144```
145 
146### Instruction Generation
147 
148```python
149INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences.
150Focus on: characters present, actions, emotions, setting.
151Do NOT quote the text directly.
152 
153Excerpt:
154{text}
155"""
156 
157# Use a fast, cheap LLM (e.g., Gemini Flash)
158instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))
159```
160 
161## Phase 4: Dataset Construction
162 
163### Message Format
164 
165```json
166{
167    "messages": [
168        {"role": "system", "content": "You are an expert creative writer..."},
169        {"role": "user", "content": "Write in the style of Author: Scene description..."},
170        {"role": "assistant", "content": "The actual book text from chunk..."}
171    ]
172}
173```
174 
175### Multiple Variants Per Chunk
176 
177```python
178def build_examples(chunk, instruction, author, variants=2):
179    examples = []
180    for i in range(variants):
181        system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)]
182        template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)]
183        user = template.format(author=author, desc=instruction)
184        examples.append({"messages": [
185            {"role": "system", "content": system},
186            {"role": "user", "content": user},
187            {"role": "assistant", "content": chunk.text}
188        ]})
189    return examples
190```
191 
192## Phase 5: LoRA Training on Tinker
193 
194### Configuration
195 
196```python
197CONFIG = {
198    "model_name": "Qwen/Qwen3-8B-Base",  # Base, not instruct
199    "lora_rank": 32,                      # 352MB adapter
200    "learning_rate": 5e-4,                # Higher for LoRA
201    "batch_size": 4,
202    "epochs": 3,
203}
204```
205 
206### Why Base Model?
207 
208Use **base** (pretrained) models, not instruction-tuned versions:
209- Base models are more malleable for new styles
210- Instruct models have patterns that resist overwriting
211- Style is a low-level pattern that base models capture better
212 
213### Training Loop
214 
215```python
216import tinker
217from tinker import types
218 
219training_client = await service_client.create_lora_training_client_async(
220    base_model="Qwen/Qwen3-8B-Base",
221    rank=32
222)
223 
224for epoch in range(3):
225    for batch in batches:
226        await training_client.forward_backward_async(batch, loss_fn="cross_entropy")
227        await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4))
228 
229result = await training_client.save_weights_for_sampler_async(name="final")
230```
231 
232## Phase 6: Validation
233 
234### Modern Scenario Test
235 
236Test with scenarios that couldn't exist in the original book:
237 
238```python
239TEST_PROMPTS = [
240    "Write about a barista making lattes",
241    "Describe lovers communicating through text messages",
242    "Write about someone anxious about climate change",
243]
244```
245 
246If the model applies style markers to modern scenarios, it learned **style**, not **content**.
247 
248### Originality Verification
249 
250```bash
251# Search training data for output phrases
252grep "specific phrase from output" dataset.jsonl
253# Should return: No matches
254```
255 
256### AI Detector Testing
257 
258Test outputs with GPTZero, Pangram, or ZeroGPT.
259 
260## Known Issues and Solutions
261 
262### Character Name Leakage
263 
264**Symptom**: Model uses original character names in new scenarios.
265**Cause**: Limited name diversity from one book.
266**Solution**: Train on multiple books or add synthetic examples.
267 
268### Model Parrots Exact Phrases
269 
270**Symptom**: Outputs contain exact sentences from training data.
271**Cause**: Too few prompt variations or too many epochs.
272**Solution**: Use 15+ templates, limit to 3 epochs.
273 
274### Fragmented Outputs
275 
276**Symptom**: Sentences feel incomplete.
277**Cause**: Poor segmentation breaking mid-thought.
278**Solution**: Always break at paragraph boundaries.
279 
280## Guidelines
281 
2821. **Always source ePub over PDF** - OCR errors become learned patterns
2832. **Never break mid-sentence** - Boundaries must be grammatically complete
2843. **Use diverse prompts** - 15+ templates, 5+ system prompts
2854. **Use base models** - Not instruct versions
2865. **Use smaller chunks** - 150-400 words for more examples
2876. **Reserve test set** - 50 examples minimum
2887. **Test on modern scenarios** - Proves style transfer vs memorization
2898. **Verify originality** - Grep training data for output phrases
290 
291## Expected Results
292 
293| Metric | Value |
294|--------|-------|
295| Training examples | 500-1000 per book |
296| Model | Qwen/Qwen3-8B-Base |
297| LoRA rank | 32 |
298| Adapter size | ~350 MB |
299| Training time | ~15 min |
300| Loss reduction | 90%+ |
301| Style transfer success | ~50% perfect |
302 
303## Cost Estimate
304 
305| Component | Cost |
306|-----------|------|
307| LLM (instruction generation) | ~$0.50 |
308| Tinker training (15 min) | ~$1.50 |
309| **Total** | **~$2.00** |
310 
311## Integration with Context Engineering Skills
312 
313This example applies several skills from the Agent Skills for Context Engineering collection:
314 
315### project-development
316The pipeline follows the staged, idempotent architecture pattern:
317- **Acquire**: Extract text from ePub
318- **Prepare**: Segment into training chunks
319- **Process**: Generate synthetic instructions
320- **Parse**: Build message format
321- **Render**: Output Tinker-compatible JSONL
322- **Train**: LoRA fine-tuning
323- **Validate**: Modern scenario testing
324 
325Each phase is resumable and produces intermediate artifacts for debugging.
326 
327### context-compression
328Segmentation is a form of context compression for training. The core insight from context-compression applies: information density matters more than information quantity. Smaller, coherent chunks (150-400 words) produce better style transfer than larger, diluted chunks.
329 
330The two-tier strategy mirrors context compression evaluation:
331- Tier 1: Fast, deterministic compression
332- Tier 2: LLM-assisted for edge cases
333 
334### multi-agent-patterns
335The pipeline uses the **supervisor/orchestrator** pattern:
336- Orchestrator coordinates phases and manages state
337- Specialized agents (Extraction, Segmentation, Instruction, Builder) have isolated contexts
338- Each agent receives only the information needed for its task
339 
340This matches the principle that sub-agents exist primarily to isolate context rather than simulate roles.
341 
342### evaluation
343Validation follows the **end-state evaluation** pattern:
344- Functional testing: Does output match expected style markers?
345- Originality verification: Is content genuinely generated?
346- External validation: AI detector scores
347 
348The "modern scenario" test is a form of out-of-distribution evaluation that proves generalization.
349 
350### context-fundamentals
351Prompt diversity prevents attention collapse on single patterns. When training with identical prompt structures, the model memorizes the instruction-response mapping. Diverse templates force attention across the style patterns themselves.
352 
353## References
354 
355Internal references:
356- [Segmentation Strategies](./references/segmentation-strategies.md) - Text chunking patterns
357- [Tinker Format Specification](./references/tinker-format.md) - Datum structure
358- [Tinker API Documentation](./references/tinker.txt) - Full API reference
359 
360Related skills from Agent Skills for Context Engineering:
361- project-development - Pipeline architecture patterns
362- context-compression - Compression strategies  
363- multi-agent-patterns - Agent coordination
364- evaluation - Evaluation frameworks
365- context-fundamentals - Attention and information density
366 
367External resources:
368- [Research Paper](https://arxiv.org/pdf/2510.13939) - Chakrabarty et al. 2025
369- [Dataset on Hugging Face](https://huggingface.co/datasets/MuratcanKoylan/gertrude-stein-style-sft)
370- [Gertrude Stein Case Study](./examples/gertrude-stein/) - Complete working example
371 
372---
373 
374## Skill Metadata
375 
376**Created**: 2025-12-26
377**Last Updated**: 2025-12-28
378**Author**: Muratcan Koylan
379**Version**: 2.0.0
380**Standalone**: Yes (separate from main context-engineering collection)
381
Preparing the source view

Agent Skills for Context Engineering

examples/book-sft-pipeline/SKILL.md