Source from repo
Agent Skills for Context Engineering

A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems.
muratcankoylanGitHub muratcankoylanSource repo Original GitHub link
Files
241
Skill
n/a
Size
2.6 MB
Entrypoint
SKILL.md
Format
git-repo
Open file
examples/book-sft-pipeline/references/segmentation-strategies.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown325 linesFree
examples/book-sft-pipeline/references/segmentation-strategies.md
1# Segmentation Strategies
2 
3Advanced patterns for splitting books into training chunks while preserving narrative coherence.
4 
5## The Segmentation Problem
6 
7Books present unique challenges for training data creation:
8 
91. **Variable paragraph length**: Some authors write single paragraphs spanning 1000+ words
102. **Dialogue-heavy sections**: Short exchanges that individually are too small
113. **Scene boundaries**: Natural break points that don't align with word counts
124. **Stylistic variations**: Authors shift voice between narrative, dialogue, and exposition
13 
14Poor segmentation teaches the model to produce:
15- Incomplete thoughts
16- Abrupt endings
17- Incoherent transitions
18- Fragmented style
19 
20## Two-Tier Strategy
21 
22### Tier 1: Paragraph-Based Accumulation
23 
24The default approach for well-structured text:
25 
26```python
27class Tier1Segmenter:
28    def __init__(self, min_words: int = 250, max_words: int = 650):
29        self.min_words = min_words
30        self.max_words = max_words
31    
32    def segment(self, text: str) -> list[Chunk]:
33        paragraphs = self._split_paragraphs(text)
34        chunks = []
35        current = ChunkBuilder()
36        
37        for para in paragraphs:
38            word_count = len(para.split())
39            
40            # Check if single paragraph exceeds max
41            if word_count > self.max_words:
42                # Finalize current chunk if exists
43                if current.word_count > 0:
44                    chunks.append(current.build())
45                    current = ChunkBuilder()
46                
47                # Mark for Tier 2 processing
48                chunks.append(Chunk(
49                    text=para,
50                    requires_tier2=True,
51                    word_count=word_count
52                ))
53                continue
54            
55            # Would this paragraph overflow current chunk?
56            if current.word_count + word_count > self.max_words:
57                if current.word_count >= self.min_words:
58                    chunks.append(current.build())
59                    current = ChunkBuilder()
60            
61            current.add(para)
62        
63        # Don't forget the last chunk
64        if current.word_count > 0:
65            chunks.append(current.build())
66        
67        return chunks
68    
69    def _split_paragraphs(self, text: str) -> list[str]:
70        # Split on double newlines, preserve single newlines within
71        paragraphs = text.split('\n\n')
72        return [p.strip() for p in paragraphs if p.strip()]
73```
74 
75### Tier 2: LLM-Assisted Segmentation
76 
77For oversized paragraphs that cannot be split at paragraph boundaries:
78 
79```python
80class Tier2Segmenter:
81    def __init__(self, model: str = "gpt-4o"):
82        self.model = model
83        self.prompt_template = self._load_prompt()
84    
85    async def segment(self, oversized_chunk: Chunk) -> list[Chunk]:
86        """Split an oversized paragraph using LLM."""
87        
88        response = await self._call_llm(
89            self.prompt_template.format(text=oversized_chunk.text)
90        )
91        
92        segments = self._parse_segments(response)
93        
94        # Validate zero-deletion
95        original_words = len(oversized_chunk.text.split())
96        segmented_words = sum(len(s.split()) for s in segments)
97        
98        if abs(original_words - segmented_words) > 5:  # Allow tiny variance
99            raise SegmentationError(
100                f"Word count mismatch: {original_words} -> {segmented_words}"
101            )
102        
103        return [
104            Chunk(text=s, requires_tier2=False, word_count=len(s.split()))
105            for s in segments
106        ]
107    
108    def _load_prompt(self) -> str:
109        return """Segment this text into excerpts of minimum 300-350 words.
110 
111Requirements:
112- Each excerpt must be grammatically complete from start
113- Each excerpt must not feel abruptly cut off
114- Zero deletion - maintain original word count exactly
115- Break at grammatically natural places:
116  * After complete dialogue exchanges
117  * At scene transitions
118  * After complete thoughts or descriptions
119  * Where a paragraph break would naturally occur
120- Avoid breaking into too many small excerpts
121- Start directly with the excerpts
122- Separate excerpts with ===SEGMENT===
123 
124Text to segment:
125{text}
126"""
127    
128    def _parse_segments(self, response: str) -> list[str]:
129        segments = response.split("===SEGMENT===")
130        return [s.strip() for s in segments if s.strip()]
131```
132 
133## Scene-Aware Segmentation
134 
135For higher-quality results, detect scene boundaries:
136 
137```python
138class SceneAwareSegmenter:
139    """Prefer breaking at scene boundaries when within word limits."""
140    
141    SCENE_MARKERS = [
142        r'\n\n\* \* \*\n\n',      # Asterisk dividers
143        r'\n\n---\n\n',            # Dash dividers
144        r'\n\n###\n\n',            # Hash dividers
145        r'\n\nCHAPTER \d+',        # Chapter headings
146        r'\n\n[A-Z]{3,}\n\n',      # All-caps scene breaks
147    ]
148    
149    def find_scene_breaks(self, text: str) -> list[int]:
150        """Find character positions of scene breaks."""
151        breaks = []
152        for pattern in self.SCENE_MARKERS:
153            for match in re.finditer(pattern, text):
154                breaks.append(match.start())
155        return sorted(set(breaks))
156    
157    def segment_with_scenes(self, text: str) -> list[Chunk]:
158        scene_breaks = self.find_scene_breaks(text)
159        
160        # If scene breaks exist, prefer them over arbitrary paragraph breaks
161        if scene_breaks:
162            return self._segment_at_scenes(text, scene_breaks)
163        else:
164            return Tier1Segmenter().segment(text)
165```
166 
167## Dialogue Handling
168 
169Dialogue-heavy sections require special handling:
170 
171```python
172class DialogueAwareSegmenter:
173    """Group dialogue exchanges to maintain conversation coherence."""
174    
175    def is_dialogue_paragraph(self, para: str) -> bool:
176        """Check if paragraph is primarily dialogue."""
177        # Count dialogue markers
178        quote_count = para.count('"') + para.count("'")
179        word_count = len(para.split())
180        
181        # If more than 20% of words are in quotes, it's dialogue-heavy
182        return quote_count > word_count * 0.2
183    
184    def segment(self, text: str) -> list[Chunk]:
185        paragraphs = text.split('\n\n')
186        chunks = []
187        current = ChunkBuilder()
188        in_dialogue_block = False
189        
190        for para in paragraphs:
191            is_dialogue = self.is_dialogue_paragraph(para)
192            
193            # Don't break in the middle of a dialogue exchange
194            if is_dialogue:
195                in_dialogue_block = True
196                current.add(para)
197            else:
198                if in_dialogue_block:
199                    # End of dialogue block - good break point
200                    in_dialogue_block = False
201                    if current.word_count >= 250:
202                        chunks.append(current.build())
203                        current = ChunkBuilder()
204                
205                current.add(para)
206                
207                # Check if we've exceeded max
208                if current.word_count > 650:
209                    chunks.append(current.build())
210                    current = ChunkBuilder()
211        
212        if current.word_count > 0:
213            chunks.append(current.build())
214        
215        return chunks
216```
217 
218## Validation Pipeline
219 
220Every segmentation result should pass validation:
221 
222```python
223class SegmentationValidator:
224    def validate(self, chunks: list[Chunk]) -> ValidationResult:
225        errors = []
226        warnings = []
227        
228        for i, chunk in enumerate(chunks):
229            # Check word count bounds
230            if chunk.word_count < 200:
231                warnings.append(f"Chunk {i}: Only {chunk.word_count} words")
232            if chunk.word_count > 700:
233                errors.append(f"Chunk {i}: {chunk.word_count} words exceeds max")
234            
235            # Check sentence completeness
236            if not self._ends_with_terminal(chunk.text):
237                errors.append(f"Chunk {i}: Ends mid-sentence")
238            
239            if not self._starts_grammatically(chunk.text):
240                errors.append(f"Chunk {i}: Starts mid-sentence")
241            
242            # Check for orphaned dialogue
243            if chunk.text.count('"') % 2 != 0:
244                warnings.append(f"Chunk {i}: Unbalanced quotes")
245        
246        return ValidationResult(
247            valid=len(errors) == 0,
248            errors=errors,
249            warnings=warnings
250        )
251    
252    def _ends_with_terminal(self, text: str) -> bool:
253        text = text.strip()
254        return text[-1] in '.!?"\'—'
255    
256    def _starts_grammatically(self, text: str) -> bool:
257        text = text.strip()
258        # Should start with capital or quote
259        return text[0].isupper() or text[0] in '"\'—'
260```
261 
262## Performance Considerations
263 
264| Strategy | Speed | Quality | Use Case |
265|----------|-------|---------|----------|
266| Tier 1 only | Fast | Moderate | Well-structured prose |
267| Tier 1 + Tier 2 | Moderate | High | Mixed paragraph lengths |
268| Scene-aware | Fast | High | Novels with clear scene breaks |
269| Dialogue-aware | Moderate | High | Dialogue-heavy fiction |
270 
271## Edge Cases
272 
273**1. Stream-of-consciousness writing**
274- Single "paragraphs" spanning pages
275- Solution: Force Tier 2 with explicit sentence boundary detection
276 
277**2. Poetry or verse**
278- Line breaks are semantic, not formatting
279- Solution: Treat each stanza as atomic unit
280 
281**3. Non-fiction with lists/bullets**
282- Bullet points break paragraph detection
283- Solution: Pre-process to convert bullets to prose
284 
285**4. Multiple narrators**
286- Voice shifts within chapters
287- Solution: Detect narrator markers and prefer breaking there
288 
289## Integration with Pipeline
290 
291```python
292class SegmentationAgent:
293    def __init__(self, config: SegmentationConfig):
294        self.tier1 = Tier1Segmenter(
295            min_words=config.min_words,
296            max_words=config.max_words
297        )
298        self.tier2 = Tier2Segmenter(model=config.tier2_model)
299        self.validator = SegmentationValidator()
300    
301    async def segment(self, text: str) -> list[Chunk]:
302        # Phase 1: Tier 1 segmentation
303        chunks = self.tier1.segment(text)
304        
305        # Phase 2: Process oversized chunks with Tier 2
306        final_chunks = []
307        for chunk in chunks:
308            if chunk.requires_tier2:
309                sub_chunks = await self.tier2.segment(chunk)
310                final_chunks.extend(sub_chunks)
311            else:
312                final_chunks.append(chunk)
313        
314        # Phase 3: Validate
315        result = self.validator.validate(final_chunks)
316        if not result.valid:
317            raise SegmentationError(result.errors)
318        
319        if result.warnings:
320            logger.warning(f"Segmentation warnings: {result.warnings}")
321        
322        return final_chunks
323```
324 
325
Preparing the source view

Agent Skills for Context Engineering

examples/book-sft-pipeline/references/segmentation-strategies.md