Source from repo

DOCX creation, editing, and analysis

Create, read, edit, and manipulate Word (.docx) documents with formatting, tables, and tracked changes

anthropicsGitHub anthropicsOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

1.1 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

SKILL.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown591 linesEntrypointFree

SKILL.md

1---
2name: docx
3description: "Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation."
4license: Proprietary. LICENSE.txt has complete terms
5---
6 
7# DOCX creation, editing, and analysis
8 
9## Overview
10 
11A .docx file is a ZIP archive containing XML files.
12 
13## Quick Reference
14 
15| Task | Approach |
16|------|----------|
17| Read/analyze content | `pandoc` or unpack for raw XML |
18| Create new document | Use `docx-js` - see Creating New Documents below |
19| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
20 
21### Converting .doc to .docx
22 
23Legacy `.doc` files must be converted before editing:
24 
25```bash
26python scripts/office/soffice.py --headless --convert-to docx document.doc
27```
28 
29### Reading Content
30 
31```bash
32# Text extraction with tracked changes
33pandoc --track-changes=all document.docx -o output.md
34 
35# Raw XML access
36python scripts/office/unpack.py document.docx unpacked/
37```
38 
39### Converting to Images
40 
41```bash
42python scripts/office/soffice.py --headless --convert-to pdf document.docx
43pdftoppm -jpeg -r 150 document.pdf page
44```
45 
46### Accepting Tracked Changes
47 
48To produce a clean document with all tracked changes accepted (requires LibreOffice):
49 
50```bash
51python scripts/accept_changes.py input.docx output.docx
52```
53 
54---
55 
56## Creating New Documents
57 
58Generate .docx files with JavaScript, then validate. Install: `npm install -g docx`
59 
60### Setup
61```javascript
62const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
63        Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
64        InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
65        PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
66        TabStopType, TabStopPosition, Column, SectionType,
67        TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
68        VerticalAlign, PageNumber, PageBreak } = require('docx');
69 
70const doc = new Document({ sections: [{ children: [/* content */] }] });
71Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));
72```
73 
74### Validation
75After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
76```bash
77python scripts/office/validate.py doc.docx
78```
79 
80### Page Size
81 
82```javascript
83// CRITICAL: docx-js defaults to A4, not US Letter
84// Always set page size explicitly for consistent results
85sections: [{
86  properties: {
87    page: {
88      size: {
89        width: 12240,   // 8.5 inches in DXA
90        height: 15840   // 11 inches in DXA
91      },
92      margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
93    }
94  },
95  children: [/* content */]
96}]
97```
98 
99**Common page sizes (DXA units, 1440 DXA = 1 inch):**
100 
101| Paper | Width | Height | Content Width (1" margins) |
102|-------|-------|--------|---------------------------|
103| US Letter | 12,240 | 15,840 | 9,360 |
104| A4 (default) | 11,906 | 16,838 | 9,026 |
105 
106**Landscape orientation:** docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
107```javascript
108size: {
109  width: 12240,   // Pass SHORT edge as width
110  height: 15840,  // Pass LONG edge as height
111  orientation: PageOrientation.LANDSCAPE  // docx-js swaps them in the XML
112},
113// Content width = 15840 - left margin - right margin (uses the long edge)
114```
115 
116### Styles (Override Built-in Headings)
117 
118Use Arial as the default font (universally supported). Keep titles black for readability.
119 
120```javascript
121const doc = new Document({
122  styles: {
123    default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
124    paragraphStyles: [
125      // IMPORTANT: Use exact IDs to override built-in styles
126      { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
127        run: { size: 32, bold: true, font: "Arial" },
128        paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
129      { id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
130        run: { size: 28, bold: true, font: "Arial" },
131        paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
132    ]
133  },
134  sections: [{
135    children: [
136      new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
137    ]
138  }]
139});
140```
141 
142### Lists (NEVER use unicode bullets)
143 
144```javascript
145// ❌ WRONG - never manually insert bullet characters
146new Paragraph({ children: [new TextRun("• Item")] })  // BAD
147new Paragraph({ children: [new TextRun("\u2022 Item")] })  // BAD
148 
149// ✅ CORRECT - use numbering config with LevelFormat.BULLET
150const doc = new Document({
151  numbering: {
152    config: [
153      { reference: "bullets",
154        levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
155          style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
156      { reference: "numbers",
157        levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
158          style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
159    ]
160  },
161  sections: [{
162    children: [
163      new Paragraph({ numbering: { reference: "bullets", level: 0 },
164        children: [new TextRun("Bullet item")] }),
165      new Paragraph({ numbering: { reference: "numbers", level: 0 },
166        children: [new TextRun("Numbered item")] }),
167    ]
168  }]
169});
170 
171// ⚠️ Each reference creates INDEPENDENT numbering
172// Same reference = continues (1,2,3 then 4,5,6)
173// Different reference = restarts (1,2,3 then 1,2,3)
174```
175 
176### Tables
177 
178**CRITICAL: Tables need dual widths** - set both `columnWidths` on the table AND `width` on each cell. Without both, tables render incorrectly on some platforms.
179 
180```javascript
181// CRITICAL: Always set table width for consistent rendering
182// CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds
183const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
184const borders = { top: border, bottom: border, left: border, right: border };
185 
186new Table({
187  width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
188  columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
189  rows: [
190    new TableRow({
191      children: [
192        new TableCell({
193          borders,
194          width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
195          shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
196          margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width)
197          children: [new Paragraph({ children: [new TextRun("Cell")] })]
198        })
199      ]
200    })
201  ]
202})
203```
204 
205**Table width calculation:**
206 
207Always use `WidthType.DXA` — `WidthType.PERCENTAGE` breaks in Google Docs.
208 
209```javascript
210// Table width = sum of columnWidths = content width
211// US Letter with 1" margins: 12240 - 2880 = 9360 DXA
212width: { size: 9360, type: WidthType.DXA },
213columnWidths: [7000, 2360]  // Must sum to table width
214```
215 
216**Width rules:**
217- **Always use `WidthType.DXA`** — never `WidthType.PERCENTAGE` (incompatible with Google Docs)
218- Table width must equal the sum of `columnWidths`
219- Cell `width` must match corresponding `columnWidth`
220- Cell `margins` are internal padding - they reduce content area, not add to cell width
221- For full-width tables: use content width (page width minus left and right margins)
222 
223### Images
224 
225```javascript
226// CRITICAL: type parameter is REQUIRED
227new Paragraph({
228  children: [new ImageRun({
229    type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
230    data: fs.readFileSync("image.png"),
231    transformation: { width: 200, height: 150 },
232    altText: { title: "Title", description: "Desc", name: "Name" } // All three required
233  })]
234})
235```
236 
237### Page Breaks
238 
239```javascript
240// CRITICAL: PageBreak must be inside a Paragraph
241new Paragraph({ children: [new PageBreak()] })
242 
243// Or use pageBreakBefore
244new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })
245```
246 
247### Hyperlinks
248 
249```javascript
250// External link
251new Paragraph({
252  children: [new ExternalHyperlink({
253    children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
254    link: "https://example.com",
255  })]
256})
257 
258// Internal link (bookmark + reference)
259// 1. Create bookmark at destination
260new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
261  new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
262]})
263// 2. Link to it
264new Paragraph({ children: [new InternalHyperlink({
265  children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
266  anchor: "chapter1",
267})]})
268```
269 
270### Footnotes
271 
272```javascript
273const doc = new Document({
274  footnotes: {
275    1: { children: [new Paragraph("Source: Annual Report 2024")] },
276    2: { children: [new Paragraph("See appendix for methodology")] },
277  },
278  sections: [{
279    children: [new Paragraph({
280      children: [
281        new TextRun("Revenue grew 15%"),
282        new FootnoteReferenceRun(1),
283        new TextRun(" using adjusted metrics"),
284        new FootnoteReferenceRun(2),
285      ],
286    })]
287  }]
288});
289```
290 
291### Tab Stops
292 
293```javascript
294// Right-align text on same line (e.g., date opposite a title)
295new Paragraph({
296  children: [
297    new TextRun("Company Name"),
298    new TextRun("\tJanuary 2025"),
299  ],
300  tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
301})
302 
303// Dot leader (e.g., TOC-style)
304new Paragraph({
305  children: [
306    new TextRun("Introduction"),
307    new TextRun({ children: [
308      new PositionalTab({
309        alignment: PositionalTabAlignment.RIGHT,
310        relativeTo: PositionalTabRelativeTo.MARGIN,
311        leader: PositionalTabLeader.DOT,
312      }),
313      "3",
314    ]}),
315  ],
316})
317```
318 
319### Multi-Column Layouts
320 
321```javascript
322// Equal-width columns
323sections: [{
324  properties: {
325    column: {
326      count: 2,          // number of columns
327      space: 720,        // gap between columns in DXA (720 = 0.5 inch)
328      equalWidth: true,
329      separate: true,    // vertical line between columns
330    },
331  },
332  children: [/* content flows naturally across columns */]
333}]
334 
335// Custom-width columns (equalWidth must be false)
336sections: [{
337  properties: {
338    column: {
339      equalWidth: false,
340      children: [
341        new Column({ width: 5400, space: 720 }),
342        new Column({ width: 3240 }),
343      ],
344    },
345  },
346  children: [/* content */]
347}]
348```
349 
350Force a column break with a new section using `type: SectionType.NEXT_COLUMN`.
351 
352### Table of Contents
353 
354```javascript
355// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
356new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })
357```
358 
359### Headers/Footers
360 
361```javascript
362sections: [{
363  properties: {
364    page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch
365  },
366  headers: {
367    default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
368  },
369  footers: {
370    default: new Footer({ children: [new Paragraph({
371      children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
372    })] })
373  },
374  children: [/* content */]
375}]
376```
377 
378### Critical Rules for docx-js
379 
380- **Set page size explicitly** - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents
381- **Landscape: pass portrait dimensions** - docx-js swaps width/height internally; pass short edge as `width`, long edge as `height`, and set `orientation: PageOrientation.LANDSCAPE`
382- **Never use `\n`** - use separate Paragraph elements
383- **Never use unicode bullets** - use `LevelFormat.BULLET` with numbering config
384- **PageBreak must be in Paragraph** - standalone creates invalid XML
385- **ImageRun requires `type`** - always specify png/jpg/etc
386- **Always set table `width` with DXA** - never use `WidthType.PERCENTAGE` (breaks in Google Docs)
387- **Tables need dual widths** - `columnWidths` array AND cell `width`, both must match
388- **Table width = sum of columnWidths** - for DXA, ensure they add up exactly
389- **Always add cell margins** - use `margins: { top: 80, bottom: 80, left: 120, right: 120 }` for readable padding
390- **Use `ShadingType.CLEAR`** - never SOLID for table shading
391- **Never use tables as dividers/rules** - cells have minimum height and render as empty boxes (including in headers/footers); use `border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }` on a Paragraph instead. For two-column footers, use tab stops (see Tab Stops section), not tables
392- **TOC requires HeadingLevel only** - no custom styles on heading paragraphs
393- **Override built-in styles** - use exact IDs: "Heading1", "Heading2", etc.
394- **Include `outlineLevel`** - required for TOC (0 for H1, 1 for H2, etc.)
395 
396---
397 
398## Editing Existing Documents
399 
400**Follow all 3 steps in order.**
401 
402### Step 1: Unpack
403```bash
404python scripts/office/unpack.py document.docx unpacked/
405```
406Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (`&#x201C;` etc.) so they survive editing. Use `--merge-runs false` to skip run merging.
407 
408### Step 2: Edit XML
409 
410Edit files in `unpacked/word/`. See XML Reference below for patterns.
411 
412**Use "Claude" as the author** for tracked changes and comments, unless the user explicitly requests use of a different name.
413 
414**Use the Edit tool directly for string replacement. Do not write Python scripts.** Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
415 
416**CRITICAL: Use smart quotes for new content.** When adding text with apostrophes or quotes, use XML entities to produce smart quotes:
417```xml
418<!-- Use these entities for professional typography -->
419<w:t>Here&#x2019;s a quote: &#x201C;Hello&#x201D;</w:t>
420```
421| Entity | Character |
422|--------|-----------|
423| `&#x2018;` | ‘ (left single) |
424| `&#x2019;` | ’ (right single / apostrophe) |
425| `&#x201C;` | “ (left double) |
426| `&#x201D;` | ” (right double) |
427 
428**Adding comments:** Use `comment.py` to handle boilerplate across multiple XML files (text must be pre-escaped XML):
429```bash
430python scripts/comment.py unpacked/ 0 "Comment text with &amp; and &#x2019;"
431python scripts/comment.py unpacked/ 1 "Reply text" --parent 0  # reply to comment 0
432python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author"  # custom author name
433```
434Then add markers to document.xml (see Comments in XML Reference).
435 
436### Step 3: Pack
437```bash
438python scripts/office/pack.py unpacked/ output.docx --original document.docx
439```
440Validates with auto-repair, condenses XML, and creates DOCX. Use `--validate false` to skip.
441 
442**Auto-repair will fix:**
443- `durableId` >= 0x7FFFFFFF (regenerates valid ID)
444- Missing `xml:space="preserve"` on `<w:t>` with whitespace
445 
446**Auto-repair won't fix:**
447- Malformed XML, invalid element nesting, missing relationships, schema violations
448 
449### Common Pitfalls
450 
451- **Replace entire `<w:r>` elements**: When adding tracked changes, replace the whole `<w:r>...</w:r>` block with `<w:del>...<w:ins>...` as siblings. Don't inject tracked change tags inside a run.
452- **Preserve `<w:rPr>` formatting**: Copy the original run's `<w:rPr>` block into your tracked change runs to maintain bold, font size, etc.
453 
454---
455 
456## XML Reference
457 
458### Schema Compliance
459 
460- **Element order in `<w:pPr>`**: `<w:pStyle>`, `<w:numPr>`, `<w:spacing>`, `<w:ind>`, `<w:jc>`, `<w:rPr>` last
461- **Whitespace**: Add `xml:space="preserve"` to `<w:t>` with leading/trailing spaces
462- **RSIDs**: Must be 8-digit hex (e.g., `00AB1234`)
463 
464### Tracked Changes
465 
466**Insertion:**
467```xml
468<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
469  <w:r><w:t>inserted text</w:t></w:r>
470</w:ins>
471```
472 
473**Deletion:**
474```xml
475<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
476  <w:r><w:delText>deleted text</w:delText></w:r>
477</w:del>
478```
479 
480**Inside `<w:del>`**: Use `<w:delText>` instead of `<w:t>`, and `<w:delInstrText>` instead of `<w:instrText>`.
481 
482**Minimal edits** - only mark what changes:
483```xml
484<!-- Change "30 days" to "60 days" -->
485<w:r><w:t>The term is </w:t></w:r>
486<w:del w:id="1" w:author="Claude" w:date="...">
487  <w:r><w:delText>30</w:delText></w:r>
488</w:del>
489<w:ins w:id="2" w:author="Claude" w:date="...">
490  <w:r><w:t>60</w:t></w:r>
491</w:ins>
492<w:r><w:t> days.</w:t></w:r>
493```
494 
495**Deleting entire paragraphs/list items** - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add `<w:del/>` inside `<w:pPr><w:rPr>`:
496```xml
497<w:p>
498  <w:pPr>
499    <w:numPr>...</w:numPr>  <!-- list numbering if present -->
500    <w:rPr>
501      <w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
502    </w:rPr>
503  </w:pPr>
504  <w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
505    <w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
506  </w:del>
507</w:p>
508```
509Without the `<w:del/>` in `<w:pPr><w:rPr>`, accepting changes leaves an empty paragraph/list item.
510 
511**Rejecting another author's insertion** - nest deletion inside their insertion:
512```xml
513<w:ins w:author="Jane" w:id="5">
514  <w:del w:author="Claude" w:id="10">
515    <w:r><w:delText>their inserted text</w:delText></w:r>
516  </w:del>
517</w:ins>
518```
519 
520**Restoring another author's deletion** - add insertion after (don't modify their deletion):
521```xml
522<w:del w:author="Jane" w:id="5">
523  <w:r><w:delText>deleted text</w:delText></w:r>
524</w:del>
525<w:ins w:author="Claude" w:id="10">
526  <w:r><w:t>deleted text</w:t></w:r>
527</w:ins>
528```
529 
530### Comments
531 
532After running `comment.py` (see Step 2), add markers to document.xml. For replies, use `--parent` flag and nest markers inside the parent's.
533 
534**CRITICAL: `<w:commentRangeStart>` and `<w:commentRangeEnd>` are siblings of `<w:r>`, never inside `<w:r>`.**
535 
536```xml
537<!-- Comment markers are direct children of w:p, never inside w:r -->
538<w:commentRangeStart w:id="0"/>
539<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
540  <w:r><w:delText>deleted</w:delText></w:r>
541</w:del>
542<w:r><w:t> more text</w:t></w:r>
543<w:commentRangeEnd w:id="0"/>
544<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
545 
546<!-- Comment 0 with reply 1 nested inside -->
547<w:commentRangeStart w:id="0"/>
548  <w:commentRangeStart w:id="1"/>
549  <w:r><w:t>text</w:t></w:r>
550  <w:commentRangeEnd w:id="1"/>
551<w:commentRangeEnd w:id="0"/>
552<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
553<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>
554```
555 
556### Images
557 
5581. Add image file to `word/media/`
5592. Add relationship to `word/_rels/document.xml.rels`:
560```xml
561<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>
562```
5633. Add content type to `[Content_Types].xml`:
564```xml
565<Default Extension="png" ContentType="image/png"/>
566```
5674. Reference in document.xml:
568```xml
569<w:drawing>
570  <wp:inline>
571    <wp:extent cx="914400" cy="914400"/>  <!-- EMUs: 914400 = 1 inch -->
572    <a:graphic>
573      <a:graphicData uri=".../picture">
574        <pic:pic>
575          <pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
576        </pic:pic>
577      </a:graphicData>
578    </a:graphic>
579  </wp:inline>
580</w:drawing>
581```
582 
583---
584 
585## Dependencies
586 
587- **pandoc**: Text extraction
588- **docx**: `npm install -g docx` (new documents)
589- **LibreOffice**: PDF conversion (auto-configured for sandboxed environments via `scripts/office/soffice.py`)
590- **Poppler**: `pdftoppm` for images
591

Marketplace

Source from repo

DOCX creation, editing, and analysis

Create, read, edit, and manipulate Word (.docx) documents with formatting, tables, and tracked changes

anthropicsGitHub anthropicsOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

1.1 MB

Entrypoint

SKILL.md

Format

git-repo

Open file

SKILL.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown591 linesEntrypointFree

SKILL.md

1---
2name: docx
3description: "Use this skill whenever the user wants to create, read, edit, or manipulate Word documents (.docx files). Triggers include: any mention of 'Word doc', 'word document', '.docx', or requests to produce professional documents with formatting like tables of contents, headings, page numbers, or letterheads. Also use when extracting or reorganizing content from .docx files, inserting or replacing images in documents, performing find-and-replace in Word files, working with tracked changes or comments, or converting content into a polished Word document. If the user asks for a 'report', 'memo', 'letter', 'template', or similar deliverable as a Word or .docx file, use this skill. Do NOT use for PDFs, spreadsheets, Google Docs, or general coding tasks unrelated to document generation."
4license: Proprietary. LICENSE.txt has complete terms
5---
6 
7# DOCX creation, editing, and analysis
8 
9## Overview
10 
11A .docx file is a ZIP archive containing XML files.
12 
13## Quick Reference
14 
15| Task | Approach |
16|------|----------|
17| Read/analyze content | `pandoc` or unpack for raw XML |
18| Create new document | Use `docx-js` - see Creating New Documents below |
19| Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below |
20 
21### Converting .doc to .docx
22 
23Legacy `.doc` files must be converted before editing:
24 
25```bash
26python scripts/office/soffice.py --headless --convert-to docx document.doc
27```
28 
29### Reading Content
30 
31```bash
32# Text extraction with tracked changes
33pandoc --track-changes=all document.docx -o output.md
34 
35# Raw XML access
36python scripts/office/unpack.py document.docx unpacked/
37```
38 
39### Converting to Images
40 
41```bash
42python scripts/office/soffice.py --headless --convert-to pdf document.docx
43pdftoppm -jpeg -r 150 document.pdf page
44```
45 
46### Accepting Tracked Changes
47 
48To produce a clean document with all tracked changes accepted (requires LibreOffice):
49 
50```bash
51python scripts/accept_changes.py input.docx output.docx
52```
53 
54---
55 
56## Creating New Documents
57 
58Generate .docx files with JavaScript, then validate. Install: `npm install -g docx`
59 
60### Setup
61```javascript
62const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun,
63        Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink,
64        InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab,
65        PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader,
66        TabStopType, TabStopPosition, Column, SectionType,
67        TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType,
68        VerticalAlign, PageNumber, PageBreak } = require('docx');
69 
70const doc = new Document({ sections: [{ children: [/* content */] }] });
71Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer));
72```
73 
74### Validation
75After creating the file, validate it. If validation fails, unpack, fix the XML, and repack.
76```bash
77python scripts/office/validate.py doc.docx
78```
79 
80### Page Size
81 
82```javascript
83// CRITICAL: docx-js defaults to A4, not US Letter
84// Always set page size explicitly for consistent results
85sections: [{
86  properties: {
87    page: {
88      size: {
89        width: 12240,   // 8.5 inches in DXA
90        height: 15840   // 11 inches in DXA
91      },
92      margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins
93    }
94  },
95  children: [/* content */]
96}]
97```
98 
99**Common page sizes (DXA units, 1440 DXA = 1 inch):**
100 
101| Paper | Width | Height | Content Width (1" margins) |
102|-------|-------|--------|---------------------------|
103| US Letter | 12,240 | 15,840 | 9,360 |
104| A4 (default) | 11,906 | 16,838 | 9,026 |
105 
106**Landscape orientation:** docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap:
107```javascript
108size: {
109  width: 12240,   // Pass SHORT edge as width
110  height: 15840,  // Pass LONG edge as height
111  orientation: PageOrientation.LANDSCAPE  // docx-js swaps them in the XML
112},
113// Content width = 15840 - left margin - right margin (uses the long edge)
114```
115 
116### Styles (Override Built-in Headings)
117 
118Use Arial as the default font (universally supported). Keep titles black for readability.
119 
120```javascript
121const doc = new Document({
122  styles: {
123    default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default
124    paragraphStyles: [
125      // IMPORTANT: Use exact IDs to override built-in styles
126      { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
127        run: { size: 32, bold: true, font: "Arial" },
128        paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC
129      { id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
130        run: { size: 28, bold: true, font: "Arial" },
131        paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } },
132    ]
133  },
134  sections: [{
135    children: [
136      new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }),
137    ]
138  }]
139});
140```
141 
142### Lists (NEVER use unicode bullets)
143 
144```javascript
145// ❌ WRONG - never manually insert bullet characters
146new Paragraph({ children: [new TextRun("• Item")] })  // BAD
147new Paragraph({ children: [new TextRun("\u2022 Item")] })  // BAD
148 
149// ✅ CORRECT - use numbering config with LevelFormat.BULLET
150const doc = new Document({
151  numbering: {
152    config: [
153      { reference: "bullets",
154        levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
155          style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
156      { reference: "numbers",
157        levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
158          style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] },
159    ]
160  },
161  sections: [{
162    children: [
163      new Paragraph({ numbering: { reference: "bullets", level: 0 },
164        children: [new TextRun("Bullet item")] }),
165      new Paragraph({ numbering: { reference: "numbers", level: 0 },
166        children: [new TextRun("Numbered item")] }),
167    ]
168  }]
169});
170 
171// ⚠️ Each reference creates INDEPENDENT numbering
172// Same reference = continues (1,2,3 then 4,5,6)
173// Different reference = restarts (1,2,3 then 1,2,3)
174```
175 
176### Tables
177 
178**CRITICAL: Tables need dual widths** - set both `columnWidths` on the table AND `width` on each cell. Without both, tables render incorrectly on some platforms.
179 
180```javascript
181// CRITICAL: Always set table width for consistent rendering
182// CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds
183const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
184const borders = { top: border, bottom: border, left: border, right: border };
185 
186new Table({
187  width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs)
188  columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch)
189  rows: [
190    new TableRow({
191      children: [
192        new TableCell({
193          borders,
194          width: { size: 4680, type: WidthType.DXA }, // Also set on each cell
195          shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID
196          margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width)
197          children: [new Paragraph({ children: [new TextRun("Cell")] })]
198        })
199      ]
200    })
201  ]
202})
203```
204 
205**Table width calculation:**
206 
207Always use `WidthType.DXA` — `WidthType.PERCENTAGE` breaks in Google Docs.
208 
209```javascript
210// Table width = sum of columnWidths = content width
211// US Letter with 1" margins: 12240 - 2880 = 9360 DXA
212width: { size: 9360, type: WidthType.DXA },
213columnWidths: [7000, 2360]  // Must sum to table width
214```
215 
216**Width rules:**
217- **Always use `WidthType.DXA`** — never `WidthType.PERCENTAGE` (incompatible with Google Docs)
218- Table width must equal the sum of `columnWidths`
219- Cell `width` must match corresponding `columnWidth`
220- Cell `margins` are internal padding - they reduce content area, not add to cell width
221- For full-width tables: use content width (page width minus left and right margins)
222 
223### Images
224 
225```javascript
226// CRITICAL: type parameter is REQUIRED
227new Paragraph({
228  children: [new ImageRun({
229    type: "png", // Required: png, jpg, jpeg, gif, bmp, svg
230    data: fs.readFileSync("image.png"),
231    transformation: { width: 200, height: 150 },
232    altText: { title: "Title", description: "Desc", name: "Name" } // All three required
233  })]
234})
235```
236 
237### Page Breaks
238 
239```javascript
240// CRITICAL: PageBreak must be inside a Paragraph
241new Paragraph({ children: [new PageBreak()] })
242 
243// Or use pageBreakBefore
244new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] })
245```
246 
247### Hyperlinks
248 
249```javascript
250// External link
251new Paragraph({
252  children: [new ExternalHyperlink({
253    children: [new TextRun({ text: "Click here", style: "Hyperlink" })],
254    link: "https://example.com",
255  })]
256})
257 
258// Internal link (bookmark + reference)
259// 1. Create bookmark at destination
260new Paragraph({ heading: HeadingLevel.HEADING_1, children: [
261  new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }),
262]})
263// 2. Link to it
264new Paragraph({ children: [new InternalHyperlink({
265  children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })],
266  anchor: "chapter1",
267})]})
268```
269 
270### Footnotes
271 
272```javascript
273const doc = new Document({
274  footnotes: {
275    1: { children: [new Paragraph("Source: Annual Report 2024")] },
276    2: { children: [new Paragraph("See appendix for methodology")] },
277  },
278  sections: [{
279    children: [new Paragraph({
280      children: [
281        new TextRun("Revenue grew 15%"),
282        new FootnoteReferenceRun(1),
283        new TextRun(" using adjusted metrics"),
284        new FootnoteReferenceRun(2),
285      ],
286    })]
287  }]
288});
289```
290 
291### Tab Stops
292 
293```javascript
294// Right-align text on same line (e.g., date opposite a title)
295new Paragraph({
296  children: [
297    new TextRun("Company Name"),
298    new TextRun("\tJanuary 2025"),
299  ],
300  tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }],
301})
302 
303// Dot leader (e.g., TOC-style)
304new Paragraph({
305  children: [
306    new TextRun("Introduction"),
307    new TextRun({ children: [
308      new PositionalTab({
309        alignment: PositionalTabAlignment.RIGHT,
310        relativeTo: PositionalTabRelativeTo.MARGIN,
311        leader: PositionalTabLeader.DOT,
312      }),
313      "3",
314    ]}),
315  ],
316})
317```
318 
319### Multi-Column Layouts
320 
321```javascript
322// Equal-width columns
323sections: [{
324  properties: {
325    column: {
326      count: 2,          // number of columns
327      space: 720,        // gap between columns in DXA (720 = 0.5 inch)
328      equalWidth: true,
329      separate: true,    // vertical line between columns
330    },
331  },
332  children: [/* content flows naturally across columns */]
333}]
334 
335// Custom-width columns (equalWidth must be false)
336sections: [{
337  properties: {
338    column: {
339      equalWidth: false,
340      children: [
341        new Column({ width: 5400, space: 720 }),
342        new Column({ width: 3240 }),
343      ],
344    },
345  },
346  children: [/* content */]
347}]
348```
349 
350Force a column break with a new section using `type: SectionType.NEXT_COLUMN`.
351 
352### Table of Contents
353 
354```javascript
355// CRITICAL: Headings must use HeadingLevel ONLY - no custom styles
356new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" })
357```
358 
359### Headers/Footers
360 
361```javascript
362sections: [{
363  properties: {
364    page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch
365  },
366  headers: {
367    default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] })
368  },
369  footers: {
370    default: new Footer({ children: [new Paragraph({
371      children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })]
372    })] })
373  },
374  children: [/* content */]
375}]
376```
377 
378### Critical Rules for docx-js
379 
380- **Set page size explicitly** - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents
381- **Landscape: pass portrait dimensions** - docx-js swaps width/height internally; pass short edge as `width`, long edge as `height`, and set `orientation: PageOrientation.LANDSCAPE`
382- **Never use `\n`** - use separate Paragraph elements
383- **Never use unicode bullets** - use `LevelFormat.BULLET` with numbering config
384- **PageBreak must be in Paragraph** - standalone creates invalid XML
385- **ImageRun requires `type`** - always specify png/jpg/etc
386- **Always set table `width` with DXA** - never use `WidthType.PERCENTAGE` (breaks in Google Docs)
387- **Tables need dual widths** - `columnWidths` array AND cell `width`, both must match
388- **Table width = sum of columnWidths** - for DXA, ensure they add up exactly
389- **Always add cell margins** - use `margins: { top: 80, bottom: 80, left: 120, right: 120 }` for readable padding
390- **Use `ShadingType.CLEAR`** - never SOLID for table shading
391- **Never use tables as dividers/rules** - cells have minimum height and render as empty boxes (including in headers/footers); use `border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }` on a Paragraph instead. For two-column footers, use tab stops (see Tab Stops section), not tables
392- **TOC requires HeadingLevel only** - no custom styles on heading paragraphs
393- **Override built-in styles** - use exact IDs: "Heading1", "Heading2", etc.
394- **Include `outlineLevel`** - required for TOC (0 for H1, 1 for H2, etc.)
395 
396---
397 
398## Editing Existing Documents
399 
400**Follow all 3 steps in order.**
401 
402### Step 1: Unpack
403```bash
404python scripts/office/unpack.py document.docx unpacked/
405```
406Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (`&#x201C;` etc.) so they survive editing. Use `--merge-runs false` to skip run merging.
407 
408### Step 2: Edit XML
409 
410Edit files in `unpacked/word/`. See XML Reference below for patterns.
411 
412**Use "Claude" as the author** for tracked changes and comments, unless the user explicitly requests use of a different name.
413 
414**Use the Edit tool directly for string replacement. Do not write Python scripts.** Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced.
415 
416**CRITICAL: Use smart quotes for new content.** When adding text with apostrophes or quotes, use XML entities to produce smart quotes:
417```xml
418<!-- Use these entities for professional typography -->
419<w:t>Here&#x2019;s a quote: &#x201C;Hello&#x201D;</w:t>
420```
421| Entity | Character |
422|--------|-----------|
423| `&#x2018;` | ‘ (left single) |
424| `&#x2019;` | ’ (right single / apostrophe) |
425| `&#x201C;` | “ (left double) |
426| `&#x201D;` | ” (right double) |
427 
428**Adding comments:** Use `comment.py` to handle boilerplate across multiple XML files (text must be pre-escaped XML):
429```bash
430python scripts/comment.py unpacked/ 0 "Comment text with &amp; and &#x2019;"
431python scripts/comment.py unpacked/ 1 "Reply text" --parent 0  # reply to comment 0
432python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author"  # custom author name
433```
434Then add markers to document.xml (see Comments in XML Reference).
435 
436### Step 3: Pack
437```bash
438python scripts/office/pack.py unpacked/ output.docx --original document.docx
439```
440Validates with auto-repair, condenses XML, and creates DOCX. Use `--validate false` to skip.
441 
442**Auto-repair will fix:**
443- `durableId` >= 0x7FFFFFFF (regenerates valid ID)
444- Missing `xml:space="preserve"` on `<w:t>` with whitespace
445 
446**Auto-repair won't fix:**
447- Malformed XML, invalid element nesting, missing relationships, schema violations
448 
449### Common Pitfalls
450 
451- **Replace entire `<w:r>` elements**: When adding tracked changes, replace the whole `<w:r>...</w:r>` block with `<w:del>...<w:ins>...` as siblings. Don't inject tracked change tags inside a run.
452- **Preserve `<w:rPr>` formatting**: Copy the original run's `<w:rPr>` block into your tracked change runs to maintain bold, font size, etc.
453 
454---
455 
456## XML Reference
457 
458### Schema Compliance
459 
460- **Element order in `<w:pPr>`**: `<w:pStyle>`, `<w:numPr>`, `<w:spacing>`, `<w:ind>`, `<w:jc>`, `<w:rPr>` last
461- **Whitespace**: Add `xml:space="preserve"` to `<w:t>` with leading/trailing spaces
462- **RSIDs**: Must be 8-digit hex (e.g., `00AB1234`)
463 
464### Tracked Changes
465 
466**Insertion:**
467```xml
468<w:ins w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
469  <w:r><w:t>inserted text</w:t></w:r>
470</w:ins>
471```
472 
473**Deletion:**
474```xml
475<w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
476  <w:r><w:delText>deleted text</w:delText></w:r>
477</w:del>
478```
479 
480**Inside `<w:del>`**: Use `<w:delText>` instead of `<w:t>`, and `<w:delInstrText>` instead of `<w:instrText>`.
481 
482**Minimal edits** - only mark what changes:
483```xml
484<!-- Change "30 days" to "60 days" -->
485<w:r><w:t>The term is </w:t></w:r>
486<w:del w:id="1" w:author="Claude" w:date="...">
487  <w:r><w:delText>30</w:delText></w:r>
488</w:del>
489<w:ins w:id="2" w:author="Claude" w:date="...">
490  <w:r><w:t>60</w:t></w:r>
491</w:ins>
492<w:r><w:t> days.</w:t></w:r>
493```
494 
495**Deleting entire paragraphs/list items** - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add `<w:del/>` inside `<w:pPr><w:rPr>`:
496```xml
497<w:p>
498  <w:pPr>
499    <w:numPr>...</w:numPr>  <!-- list numbering if present -->
500    <w:rPr>
501      <w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z"/>
502    </w:rPr>
503  </w:pPr>
504  <w:del w:id="2" w:author="Claude" w:date="2025-01-01T00:00:00Z">
505    <w:r><w:delText>Entire paragraph content being deleted...</w:delText></w:r>
506  </w:del>
507</w:p>
508```
509Without the `<w:del/>` in `<w:pPr><w:rPr>`, accepting changes leaves an empty paragraph/list item.
510 
511**Rejecting another author's insertion** - nest deletion inside their insertion:
512```xml
513<w:ins w:author="Jane" w:id="5">
514  <w:del w:author="Claude" w:id="10">
515    <w:r><w:delText>their inserted text</w:delText></w:r>
516  </w:del>
517</w:ins>
518```
519 
520**Restoring another author's deletion** - add insertion after (don't modify their deletion):
521```xml
522<w:del w:author="Jane" w:id="5">
523  <w:r><w:delText>deleted text</w:delText></w:r>
524</w:del>
525<w:ins w:author="Claude" w:id="10">
526  <w:r><w:t>deleted text</w:t></w:r>
527</w:ins>
528```
529 
530### Comments
531 
532After running `comment.py` (see Step 2), add markers to document.xml. For replies, use `--parent` flag and nest markers inside the parent's.
533 
534**CRITICAL: `<w:commentRangeStart>` and `<w:commentRangeEnd>` are siblings of `<w:r>`, never inside `<w:r>`.**
535 
536```xml
537<!-- Comment markers are direct children of w:p, never inside w:r -->
538<w:commentRangeStart w:id="0"/>
539<w:del w:id="1" w:author="Claude" w:date="2025-01-01T00:00:00Z">
540  <w:r><w:delText>deleted</w:delText></w:r>
541</w:del>
542<w:r><w:t> more text</w:t></w:r>
543<w:commentRangeEnd w:id="0"/>
544<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
545 
546<!-- Comment 0 with reply 1 nested inside -->
547<w:commentRangeStart w:id="0"/>
548  <w:commentRangeStart w:id="1"/>
549  <w:r><w:t>text</w:t></w:r>
550  <w:commentRangeEnd w:id="1"/>
551<w:commentRangeEnd w:id="0"/>
552<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="0"/></w:r>
553<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="1"/></w:r>
554```
555 
556### Images
557 
5581. Add image file to `word/media/`
5592. Add relationship to `word/_rels/document.xml.rels`:
560```xml
561<Relationship Id="rId5" Type=".../image" Target="media/image1.png"/>
562```
5633. Add content type to `[Content_Types].xml`:
564```xml
565<Default Extension="png" ContentType="image/png"/>
566```
5674. Reference in document.xml:
568```xml
569<w:drawing>
570  <wp:inline>
571    <wp:extent cx="914400" cy="914400"/>  <!-- EMUs: 914400 = 1 inch -->
572    <a:graphic>
573      <a:graphicData uri=".../picture">
574        <pic:pic>
575          <pic:blipFill><a:blip r:embed="rId5"/></pic:blipFill>
576        </pic:pic>
577      </a:graphicData>
578    </a:graphic>
579  </wp:inline>
580</w:drawing>
581```
582 
583---
584 
585## Dependencies
586 
587- **pandoc**: Text extraction
588- **docx**: `npm install -g docx` (new documents)
589- **LibreOffice**: PDF conversion (auto-configured for sandboxed environments via `scripts/office/soffice.py`)
590- **Poppler**: `pdftoppm` for images
591

DOCX creation, editing, and analysis

SKILL.md

Preparing the source view

DOCX creation, editing, and analysis

SKILL.md