Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from repo
Read, create, merge, split, watermark, encrypt, OCR, and fill PDF files using Python and CLI tools
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
SKILL.md
1---2name: pdf3description: Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.4license: Proprietary. LICENSE.txt has complete terms5---67# PDF Processing Guide89## Overview1011This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.1213## Quick Start1415```python16from pypdf import PdfReader, PdfWriter1718# Read a PDF19reader = PdfReader("document.pdf")20print(f"Pages: {len(reader.pages)}")2122# Extract text23text = ""24for page in reader.pages:25text += page.extract_text()26```2728## Python Libraries2930### pypdf - Basic Operations3132#### Merge PDFs33```python34from pypdf import PdfWriter, PdfReader3536writer = PdfWriter()37for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:38reader = PdfReader(pdf_file)39for page in reader.pages:40writer.add_page(page)4142with open("merged.pdf", "wb") as output:43writer.write(output)44```4546#### Split PDF47```python48reader = PdfReader("input.pdf")49for i, page in enumerate(reader.pages):50writer = PdfWriter()51writer.add_page(page)52with open(f"page_{i+1}.pdf", "wb") as output:53writer.write(output)54```5556#### Extract Metadata57```python58reader = PdfReader("document.pdf")59meta = reader.metadata60print(f"Title: {meta.title}")61print(f"Author: {meta.author}")62print(f"Subject: {meta.subject}")63print(f"Creator: {meta.creator}")64```6566#### Rotate Pages67```python68reader = PdfReader("input.pdf")69writer = PdfWriter()7071page = reader.pages[0]72page.rotate(90) # Rotate 90 degrees clockwise73writer.add_page(page)7475with open("rotated.pdf", "wb") as output:76writer.write(output)77```7879### pdfplumber - Text and Table Extraction8081#### Extract Text with Layout82```python83import pdfplumber8485with pdfplumber.open("document.pdf") as pdf:86for page in pdf.pages:87text = page.extract_text()88print(text)89```9091#### Extract Tables92```python93with pdfplumber.open("document.pdf") as pdf:94for i, page in enumerate(pdf.pages):95tables = page.extract_tables()96for j, table in enumerate(tables):97print(f"Table {j+1} on page {i+1}:")98for row in table:99print(row)100```101102#### Advanced Table Extraction103```python104import pandas as pd105106with pdfplumber.open("document.pdf") as pdf:107all_tables = []108for page in pdf.pages:109tables = page.extract_tables()110for table in tables:111if table: # Check if table is not empty112df = pd.DataFrame(table[1:], columns=table[0])113all_tables.append(df)114115# Combine all tables116if all_tables:117combined_df = pd.concat(all_tables, ignore_index=True)118combined_df.to_excel("extracted_tables.xlsx", index=False)119```120121### reportlab - Create PDFs122123#### Basic PDF Creation124```python125from reportlab.lib.pagesizes import letter126from reportlab.pdfgen import canvas127128c = canvas.Canvas("hello.pdf", pagesize=letter)129width, height = letter130131# Add text132c.drawString(100, height - 100, "Hello World!")133c.drawString(100, height - 120, "This is a PDF created with reportlab")134135# Add a line136c.line(100, height - 140, 400, height - 140)137138# Save139c.save()140```141142#### Create PDF with Multiple Pages143```python144from reportlab.lib.pagesizes import letter145from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak146from reportlab.lib.styles import getSampleStyleSheet147148doc = SimpleDocTemplate("report.pdf", pagesize=letter)149styles = getSampleStyleSheet()150story = []151152# Add content153title = Paragraph("Report Title", styles['Title'])154story.append(title)155story.append(Spacer(1, 12))156157body = Paragraph("This is the body of the report. " * 20, styles['Normal'])158story.append(body)159story.append(PageBreak())160161# Page 2162story.append(Paragraph("Page 2", styles['Heading1']))163story.append(Paragraph("Content for page 2", styles['Normal']))164165# Build PDF166doc.build(story)167```168169#### Subscripts and Superscripts170171**IMPORTANT**: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.172173Instead, use ReportLab's XML markup tags in Paragraph objects:174```python175from reportlab.platypus import Paragraph176from reportlab.lib.styles import getSampleStyleSheet177178styles = getSampleStyleSheet()179180# Subscripts: use <sub> tag181chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])182183# Superscripts: use <super> tag184squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])185```186187For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode subscripts/superscripts.188189## Command-Line Tools190191### pdftotext (poppler-utils)192```bash193# Extract text194pdftotext input.pdf output.txt195196# Extract text preserving layout197pdftotext -layout input.pdf output.txt198199# Extract specific pages200pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5201```202203### qpdf204```bash205# Merge PDFs206qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf207208# Split pages209qpdf input.pdf --pages . 1-5 -- pages1-5.pdf210qpdf input.pdf --pages . 6-10 -- pages6-10.pdf211212# Rotate pages213qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees214215# Remove password216qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf217```218219### pdftk (if available)220```bash221# Merge222pdftk file1.pdf file2.pdf cat output merged.pdf223224# Split225pdftk input.pdf burst226227# Rotate228pdftk input.pdf rotate 1east output rotated.pdf229```230231## Common Tasks232233### Extract Text from Scanned PDFs234```python235# Requires: pip install pytesseract pdf2image236import pytesseract237from pdf2image import convert_from_path238239# Convert PDF to images240images = convert_from_path('scanned.pdf')241242# OCR each page243text = ""244for i, image in enumerate(images):245text += f"Page {i+1}:\n"246text += pytesseract.image_to_string(image)247text += "\n\n"248249print(text)250```251252### Add Watermark253```python254from pypdf import PdfReader, PdfWriter255256# Create watermark (or load existing)257watermark = PdfReader("watermark.pdf").pages[0]258259# Apply to all pages260reader = PdfReader("document.pdf")261writer = PdfWriter()262263for page in reader.pages:264page.merge_page(watermark)265writer.add_page(page)266267with open("watermarked.pdf", "wb") as output:268writer.write(output)269```270271### Extract Images272```bash273# Using pdfimages (poppler-utils)274pdfimages -j input.pdf output_prefix275276# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.277```278279### Password Protection280```python281from pypdf import PdfReader, PdfWriter282283reader = PdfReader("input.pdf")284writer = PdfWriter()285286for page in reader.pages:287writer.add_page(page)288289# Add password290writer.encrypt("userpassword", "ownerpassword")291292with open("encrypted.pdf", "wb") as output:293writer.write(output)294```295296## Quick Reference297298| Task | Best Tool | Command/Code |299|------|-----------|--------------|300| Merge PDFs | pypdf | `writer.add_page(page)` |301| Split PDFs | pypdf | One page per file |302| Extract text | pdfplumber | `page.extract_text()` |303| Extract tables | pdfplumber | `page.extract_tables()` |304| Create PDFs | reportlab | Canvas or Platypus |305| Command line merge | qpdf | `qpdf --empty --pages ...` |306| OCR scanned PDFs | pytesseract | Convert to image first |307| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |308309## Next Steps310311- For advanced pypdfium2 usage, see REFERENCE.md312- For JavaScript libraries (pdf-lib), see REFERENCE.md313- If you need to fill out a PDF form, follow the instructions in FORMS.md314- For troubleshooting guides, see REFERENCE.md315