Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
examples/book-sft-pipeline/SKILL.md
1---2name: book-sft-pipeline3description: This skill should be used when the user asks to "fine-tune on books", "create SFT dataset", "train style model", "extract ePub text", or mentions style transfer, LoRA training, book segmentation, or author voice replication.4version: 2.0.05---67# Book SFT Pipeline89A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.1011## When to Activate1213Activate this skill when:14- Building fine-tuning datasets from literary works15- Creating author-voice or style-transfer models16- Preparing training data for Tinker or similar SFT platforms17- Designing text segmentation pipelines for long-form content18- Training small models (8B or less) on limited data1920## Core Concepts2122### The Three Pillars of Book SFT2324**1. Intelligent Segmentation**25Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.2627**2. Diverse Instruction Generation**28Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.2930**3. Style Over Content**31The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.3233## Pipeline Architecture3435```36┌─────────────────────────────────────────────────────────────────┐37│ ORCHESTRATOR AGENT │38│ Coordinates pipeline phases, manages state, handles failures │39└──────────────────────┬──────────────────────────────────────────┘40│41┌───────────────┼───────────────┬───────────────┐42▼ ▼ ▼ ▼43┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐44│ EXTRACTION │ │ SEGMENTATION │ │ INSTRUCTION │ │ DATASET │45│ AGENT │ │ AGENT │ │ AGENT │ │ BUILDER │46│ ePub → Text │ │ Text → Chunks│ │ Chunks → │ │ Pairs → │47│ │ │ 150-400 words│ │ Prompts │ │ JSONL │48└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘49│50┌───────────────┴───────────────┐51▼ ▼52┌──────────────┐ ┌──────────────┐53│ TRAINING │ │ VALIDATION │54│ AGENT │ │ AGENT │55│ LoRA on │ │ AI detector │56│ Tinker │ │ Originality │57└──────────────┘ └──────────────┘58```5960## Phase 1: Text Extraction6162### Critical Rules631. **Always source ePub over PDF** - OCR errors become learned patterns642. **Use paragraph-level extraction** - Extract from `<p>` tags to preserve breaks653. **Remove front/back matter** - Copyright and TOC pollute the dataset6667```python68# Extract text from ePub paragraphs69from epub2 import EPub70from bs4 import BeautifulSoup7172def extract_epub(path):73book = EPub(path)74chapters = []75for item in book.flow:76html = book.get_chapter(item.id)77soup = BeautifulSoup(html, 'html.parser')78paragraphs = [p.get_text().strip() for p in soup.find_all('p')]79chapters.append('\n\n'.join(p for p in paragraphs if p))80return '\n\n'.join(chapters)81```8283## Phase 2: Intelligent Segmentation8485### Smaller Chunks + Overlap8687Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).8889```python90def segment(text, min_words=150, max_words=400):91paragraphs = text.split('\n\n')92chunks, buffer, buffer_words = [], [], 09394for para in paragraphs:95words = len(para.split())96if buffer_words + words > max_words and buffer_words >= min_words:97chunks.append('\n\n'.join(buffer))98# Keep last paragraph for overlap99buffer = [buffer[-1], para] if buffer else [para]100buffer_words = sum(len(p.split()) for p in buffer)101else:102buffer.append(para)103buffer_words += words104105if buffer:106chunks.append('\n\n'.join(buffer))107return chunks108```109110### Expected Results111112For an 86,000-word book:113- Old method (250-650 words): ~150 chunks114- New method (150-400 + overlap): ~300 chunks115- With 2 variants per chunk: 600+ training examples116117## Phase 3: Diverse Instruction Generation118119### The Key Insight120121Using a single prompt template causes memorization. Diverse templates teach the underlying style.122123```python124SYSTEM_PROMPTS = [125"You are an expert creative writer capable of emulating specific literary styles.",126"You are a literary writer with deep knowledge of classic prose styles.",127"You are a creative writer skilled at emulating distinctive authorial voices.",128"You write prose that captures the essence of modernist literature.",129"You are a talented writer who can channel classic American authors.",130]131132PROMPT_TEMPLATES = [133"Write a passage in the style of {author}: {desc}",134"Channel {author}'s voice to write about: {desc}",135"In {author}'s distinctive prose style, describe: {desc}",136"Write this scene as {author} would have: {desc}",137"Using {author}'s repetitive technique, describe: {desc}",138"Capture the rhythm of {author} in this passage: {desc}",139"Write like {author}: {desc}",140"In the voice of {author}, write: {desc}",141"This is a literary exercise. Write like {author}: {desc}",142"Can you write in {author}'s style? {desc}",143]144```145146### Instruction Generation147148```python149INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences.150Focus on: characters present, actions, emotions, setting.151Do NOT quote the text directly.152153Excerpt:154{text}155"""156157# Use a fast, cheap LLM (e.g., Gemini Flash)158instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))159```160161## Phase 4: Dataset Construction162163### Message Format164165```json166{167"messages": [168{"role": "system", "content": "You are an expert creative writer..."},169{"role": "user", "content": "Write in the style of Author: Scene description..."},170{"role": "assistant", "content": "The actual book text from chunk..."}171]172}173```174175### Multiple Variants Per Chunk176177```python178def build_examples(chunk, instruction, author, variants=2):179examples = []180for i in range(variants):181system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)]182template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)]183user = template.format(author=author, desc=instruction)184examples.append({"messages": [185{"role": "system", "content": system},186{"role": "user", "content": user},187{"role": "assistant", "content": chunk.text}188]})189return examples190```191192## Phase 5: LoRA Training on Tinker193194### Configuration195196```python197CONFIG = {198"model_name": "Qwen/Qwen3-8B-Base", # Base, not instruct199"lora_rank": 32, # 352MB adapter200"learning_rate": 5e-4, # Higher for LoRA201"batch_size": 4,202"epochs": 3,203}204```205206### Why Base Model?207208Use **base** (pretrained) models, not instruction-tuned versions:209- Base models are more malleable for new styles210- Instruct models have patterns that resist overwriting211- Style is a low-level pattern that base models capture better212213### Training Loop214215```python216import tinker217from tinker import types218219training_client = await service_client.create_lora_training_client_async(220base_model="Qwen/Qwen3-8B-Base",221rank=32222)223224for epoch in range(3):225for batch in batches:226await training_client.forward_backward_async(batch, loss_fn="cross_entropy")227await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4))228229result = await training_client.save_weights_for_sampler_async(name="final")230```231232## Phase 6: Validation233234### Modern Scenario Test235236Test with scenarios that couldn't exist in the original book:237238```python239TEST_PROMPTS = [240"Write about a barista making lattes",241"Describe lovers communicating through text messages",242"Write about someone anxious about climate change",243]244```245246If the model applies style markers to modern scenarios, it learned **style**, not **content**.247248### Originality Verification249250```bash251# Search training data for output phrases252grep "specific phrase from output" dataset.jsonl253# Should return: No matches254```255256### AI Detector Testing257258Test outputs with GPTZero, Pangram, or ZeroGPT.259260## Known Issues and Solutions261262### Character Name Leakage263264**Symptom**: Model uses original character names in new scenarios.265**Cause**: Limited name diversity from one book.266**Solution**: Train on multiple books or add synthetic examples.267268### Model Parrots Exact Phrases269270**Symptom**: Outputs contain exact sentences from training data.271**Cause**: Too few prompt variations or too many epochs.272**Solution**: Use 15+ templates, limit to 3 epochs.273274### Fragmented Outputs275276**Symptom**: Sentences feel incomplete.277**Cause**: Poor segmentation breaking mid-thought.278**Solution**: Always break at paragraph boundaries.279280## Guidelines2812821. **Always source ePub over PDF** - OCR errors become learned patterns2832. **Never break mid-sentence** - Boundaries must be grammatically complete2843. **Use diverse prompts** - 15+ templates, 5+ system prompts2854. **Use base models** - Not instruct versions2865. **Use smaller chunks** - 150-400 words for more examples2876. **Reserve test set** - 50 examples minimum2887. **Test on modern scenarios** - Proves style transfer vs memorization2898. **Verify originality** - Grep training data for output phrases290291## Expected Results292293| Metric | Value |294|--------|-------|295| Training examples | 500-1000 per book |296| Model | Qwen/Qwen3-8B-Base |297| LoRA rank | 32 |298| Adapter size | ~350 MB |299| Training time | ~15 min |300| Loss reduction | 90%+ |301| Style transfer success | ~50% perfect |302303## Cost Estimate304305| Component | Cost |306|-----------|------|307| LLM (instruction generation) | ~$0.50 |308| Tinker training (15 min) | ~$1.50 |309| **Total** | **~$2.00** |310311## Integration with Context Engineering Skills312313This example applies several skills from the Agent Skills for Context Engineering collection:314315### project-development316The pipeline follows the staged, idempotent architecture pattern:317- **Acquire**: Extract text from ePub318- **Prepare**: Segment into training chunks319- **Process**: Generate synthetic instructions320- **Parse**: Build message format321- **Render**: Output Tinker-compatible JSONL322- **Train**: LoRA fine-tuning323- **Validate**: Modern scenario testing324325Each phase is resumable and produces intermediate artifacts for debugging.326327### context-compression328Segmentation is a form of context compression for training. The core insight from context-compression applies: information density matters more than information quantity. Smaller, coherent chunks (150-400 words) produce better style transfer than larger, diluted chunks.329330The two-tier strategy mirrors context compression evaluation:331- Tier 1: Fast, deterministic compression332- Tier 2: LLM-assisted for edge cases333334### multi-agent-patterns335The pipeline uses the **supervisor/orchestrator** pattern:336- Orchestrator coordinates phases and manages state337- Specialized agents (Extraction, Segmentation, Instruction, Builder) have isolated contexts338- Each agent receives only the information needed for its task339340This matches the principle that sub-agents exist primarily to isolate context rather than simulate roles.341342### evaluation343Validation follows the **end-state evaluation** pattern:344- Functional testing: Does output match expected style markers?345- Originality verification: Is content genuinely generated?346- External validation: AI detector scores347348The "modern scenario" test is a form of out-of-distribution evaluation that proves generalization.349350### context-fundamentals351Prompt diversity prevents attention collapse on single patterns. When training with identical prompt structures, the model memorizes the instruction-response mapping. Diverse templates force attention across the style patterns themselves.352353## References354355Internal references:356- [Segmentation Strategies](./references/segmentation-strategies.md) - Text chunking patterns357- [Tinker Format Specification](./references/tinker-format.md) - Datum structure358- [Tinker API Documentation](./references/tinker.txt) - Full API reference359360Related skills from Agent Skills for Context Engineering:361- project-development - Pipeline architecture patterns362- context-compression - Compression strategies363- multi-agent-patterns - Agent coordination364- evaluation - Evaluation frameworks365- context-fundamentals - Attention and information density366367External resources:368- [Research Paper](https://arxiv.org/pdf/2510.13939) - Chakrabarty et al. 2025369- [Dataset on Hugging Face](https://huggingface.co/datasets/MuratcanKoylan/gertrude-stein-style-sft)370- [Gertrude Stein Case Study](./examples/gertrude-stein/) - Complete working example371372---373374## Skill Metadata375376**Created**: 2025-12-26377**Last Updated**: 2025-12-28378**Author**: Muratcan Koylan379**Version**: 2.0.0380**Standalone**: Yes (separate from main context-engineering collection)381