Source from repo

PDF Processing Guide

Read, create, merge, split, watermark, encrypt, OCR, and fill PDF files using Python and CLI tools

anthropicsGitHub anthropicsOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

57.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

reference.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown612 linesFree

reference.md

1# PDF Processing Advanced Reference
2 
3This document contains advanced PDF processing features, detailed examples, and additional libraries not covered in the main skill instructions.
4 
5## pypdfium2 Library (Apache/BSD License)
6 
7### Overview
8pypdfium2 is a Python binding for PDFium (Chromium's PDF library). It's excellent for fast PDF rendering, image generation, and serves as a PyMuPDF replacement.
9 
10### Render PDF to Images
11```python
12import pypdfium2 as pdfium
13from PIL import Image
14 
15# Load PDF
16pdf = pdfium.PdfDocument("document.pdf")
17 
18# Render page to image
19page = pdf[0]  # First page
20bitmap = page.render(
21    scale=2.0,  # Higher resolution
22    rotation=0  # No rotation
23)
24 
25# Convert to PIL Image
26img = bitmap.to_pil()
27img.save("page_1.png", "PNG")
28 
29# Process multiple pages
30for i, page in enumerate(pdf):
31    bitmap = page.render(scale=1.5)
32    img = bitmap.to_pil()
33    img.save(f"page_{i+1}.jpg", "JPEG", quality=90)
34```
35 
36### Extract Text with pypdfium2
37```python
38import pypdfium2 as pdfium
39 
40pdf = pdfium.PdfDocument("document.pdf")
41for i, page in enumerate(pdf):
42    text = page.get_text()
43    print(f"Page {i+1} text length: {len(text)} chars")
44```
45 
46## JavaScript Libraries
47 
48### pdf-lib (MIT License)
49 
50pdf-lib is a powerful JavaScript library for creating and modifying PDF documents in any JavaScript environment.
51 
52#### Load and Manipulate Existing PDF
53```javascript
54import { PDFDocument } from 'pdf-lib';
55import fs from 'fs';
56 
57async function manipulatePDF() {
58    // Load existing PDF
59    const existingPdfBytes = fs.readFileSync('input.pdf');
60    const pdfDoc = await PDFDocument.load(existingPdfBytes);
61 
62    // Get page count
63    const pageCount = pdfDoc.getPageCount();
64    console.log(`Document has ${pageCount} pages`);
65 
66    // Add new page
67    const newPage = pdfDoc.addPage([600, 400]);
68    newPage.drawText('Added by pdf-lib', {
69        x: 100,
70        y: 300,
71        size: 16
72    });
73 
74    // Save modified PDF
75    const pdfBytes = await pdfDoc.save();
76    fs.writeFileSync('modified.pdf', pdfBytes);
77}
78```
79 
80#### Create Complex PDFs from Scratch
81```javascript
82import { PDFDocument, rgb, StandardFonts } from 'pdf-lib';
83import fs from 'fs';
84 
85async function createPDF() {
86    const pdfDoc = await PDFDocument.create();
87 
88    // Add fonts
89    const helveticaFont = await pdfDoc.embedFont(StandardFonts.Helvetica);
90    const helveticaBold = await pdfDoc.embedFont(StandardFonts.HelveticaBold);
91 
92    // Add page
93    const page = pdfDoc.addPage([595, 842]); // A4 size
94    const { width, height } = page.getSize();
95 
96    // Add text with styling
97    page.drawText('Invoice #12345', {
98        x: 50,
99        y: height - 50,
100        size: 18,
101        font: helveticaBold,
102        color: rgb(0.2, 0.2, 0.8)
103    });
104 
105    // Add rectangle (header background)
106    page.drawRectangle({
107        x: 40,
108        y: height - 100,
109        width: width - 80,
110        height: 30,
111        color: rgb(0.9, 0.9, 0.9)
112    });
113 
114    // Add table-like content
115    const items = [
116        ['Item', 'Qty', 'Price', 'Total'],
117        ['Widget', '2', '$50', '$100'],
118        ['Gadget', '1', '$75', '$75']
119    ];
120 
121    let yPos = height - 150;
122    items.forEach(row => {
123        let xPos = 50;
124        row.forEach(cell => {
125            page.drawText(cell, {
126                x: xPos,
127                y: yPos,
128                size: 12,
129                font: helveticaFont
130            });
131            xPos += 120;
132        });
133        yPos -= 25;
134    });
135 
136    const pdfBytes = await pdfDoc.save();
137    fs.writeFileSync('created.pdf', pdfBytes);
138}
139```
140 
141#### Advanced Merge and Split Operations
142```javascript
143import { PDFDocument } from 'pdf-lib';
144import fs from 'fs';
145 
146async function mergePDFs() {
147    // Create new document
148    const mergedPdf = await PDFDocument.create();
149 
150    // Load source PDFs
151    const pdf1Bytes = fs.readFileSync('doc1.pdf');
152    const pdf2Bytes = fs.readFileSync('doc2.pdf');
153 
154    const pdf1 = await PDFDocument.load(pdf1Bytes);
155    const pdf2 = await PDFDocument.load(pdf2Bytes);
156 
157    // Copy pages from first PDF
158    const pdf1Pages = await mergedPdf.copyPages(pdf1, pdf1.getPageIndices());
159    pdf1Pages.forEach(page => mergedPdf.addPage(page));
160 
161    // Copy specific pages from second PDF (pages 0, 2, 4)
162    const pdf2Pages = await mergedPdf.copyPages(pdf2, [0, 2, 4]);
163    pdf2Pages.forEach(page => mergedPdf.addPage(page));
164 
165    const mergedPdfBytes = await mergedPdf.save();
166    fs.writeFileSync('merged.pdf', mergedPdfBytes);
167}
168```
169 
170### pdfjs-dist (Apache License)
171 
172PDF.js is Mozilla's JavaScript library for rendering PDFs in the browser.
173 
174#### Basic PDF Loading and Rendering
175```javascript
176import * as pdfjsLib from 'pdfjs-dist';
177 
178// Configure worker (important for performance)
179pdfjsLib.GlobalWorkerOptions.workerSrc = './pdf.worker.js';
180 
181async function renderPDF() {
182    // Load PDF
183    const loadingTask = pdfjsLib.getDocument('document.pdf');
184    const pdf = await loadingTask.promise;
185 
186    console.log(`Loaded PDF with ${pdf.numPages} pages`);
187 
188    // Get first page
189    const page = await pdf.getPage(1);
190    const viewport = page.getViewport({ scale: 1.5 });
191 
192    // Render to canvas
193    const canvas = document.createElement('canvas');
194    const context = canvas.getContext('2d');
195    canvas.height = viewport.height;
196    canvas.width = viewport.width;
197 
198    const renderContext = {
199        canvasContext: context,
200        viewport: viewport
201    };
202 
203    await page.render(renderContext).promise;
204    document.body.appendChild(canvas);
205}
206```
207 
208#### Extract Text with Coordinates
209```javascript
210import * as pdfjsLib from 'pdfjs-dist';
211 
212async function extractText() {
213    const loadingTask = pdfjsLib.getDocument('document.pdf');
214    const pdf = await loadingTask.promise;
215 
216    let fullText = '';
217 
218    // Extract text from all pages
219    for (let i = 1; i <= pdf.numPages; i++) {
220        const page = await pdf.getPage(i);
221        const textContent = await page.getTextContent();
222 
223        const pageText = textContent.items
224            .map(item => item.str)
225            .join(' ');
226 
227        fullText += `\n--- Page ${i} ---\n${pageText}`;
228 
229        // Get text with coordinates for advanced processing
230        const textWithCoords = textContent.items.map(item => ({
231            text: item.str,
232            x: item.transform[4],
233            y: item.transform[5],
234            width: item.width,
235            height: item.height
236        }));
237    }
238 
239    console.log(fullText);
240    return fullText;
241}
242```
243 
244#### Extract Annotations and Forms
245```javascript
246import * as pdfjsLib from 'pdfjs-dist';
247 
248async function extractAnnotations() {
249    const loadingTask = pdfjsLib.getDocument('annotated.pdf');
250    const pdf = await loadingTask.promise;
251 
252    for (let i = 1; i <= pdf.numPages; i++) {
253        const page = await pdf.getPage(i);
254        const annotations = await page.getAnnotations();
255 
256        annotations.forEach(annotation => {
257            console.log(`Annotation type: ${annotation.subtype}`);
258            console.log(`Content: ${annotation.contents}`);
259            console.log(`Coordinates: ${JSON.stringify(annotation.rect)}`);
260        });
261    }
262}
263```
264 
265## Advanced Command-Line Operations
266 
267### poppler-utils Advanced Features
268 
269#### Extract Text with Bounding Box Coordinates
270```bash
271# Extract text with bounding box coordinates (essential for structured data)
272pdftotext -bbox-layout document.pdf output.xml
273 
274# The XML output contains precise coordinates for each text element
275```
276 
277#### Advanced Image Conversion
278```bash
279# Convert to PNG images with specific resolution
280pdftoppm -png -r 300 document.pdf output_prefix
281 
282# Convert specific page range with high resolution
283pdftoppm -png -r 600 -f 1 -l 3 document.pdf high_res_pages
284 
285# Convert to JPEG with quality setting
286pdftoppm -jpeg -jpegopt quality=85 -r 200 document.pdf jpeg_output
287```
288 
289#### Extract Embedded Images
290```bash
291# Extract all embedded images with metadata
292pdfimages -j -p document.pdf page_images
293 
294# List image info without extracting
295pdfimages -list document.pdf
296 
297# Extract images in their original format
298pdfimages -all document.pdf images/img
299```
300 
301### qpdf Advanced Features
302 
303#### Complex Page Manipulation
304```bash
305# Split PDF into groups of pages
306qpdf --split-pages=3 input.pdf output_group_%02d.pdf
307 
308# Extract specific pages with complex ranges
309qpdf input.pdf --pages input.pdf 1,3-5,8,10-end -- extracted.pdf
310 
311# Merge specific pages from multiple PDFs
312qpdf --empty --pages doc1.pdf 1-3 doc2.pdf 5-7 doc3.pdf 2,4 -- combined.pdf
313```
314 
315#### PDF Optimization and Repair
316```bash
317# Optimize PDF for web (linearize for streaming)
318qpdf --linearize input.pdf optimized.pdf
319 
320# Remove unused objects and compress
321qpdf --optimize-level=all input.pdf compressed.pdf
322 
323# Attempt to repair corrupted PDF structure
324qpdf --check input.pdf
325qpdf --fix-qdf damaged.pdf repaired.pdf
326 
327# Show detailed PDF structure for debugging
328qpdf --show-all-pages input.pdf > structure.txt
329```
330 
331#### Advanced Encryption
332```bash
333# Add password protection with specific permissions
334qpdf --encrypt user_pass owner_pass 256 --print=none --modify=none -- input.pdf encrypted.pdf
335 
336# Check encryption status
337qpdf --show-encryption encrypted.pdf
338 
339# Remove password protection (requires password)
340qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf
341```
342 
343## Advanced Python Techniques
344 
345### pdfplumber Advanced Features
346 
347#### Extract Text with Precise Coordinates
348```python
349import pdfplumber
350 
351with pdfplumber.open("document.pdf") as pdf:
352    page = pdf.pages[0]
353    
354    # Extract all text with coordinates
355    chars = page.chars
356    for char in chars[:10]:  # First 10 characters
357        print(f"Char: '{char['text']}' at x:{char['x0']:.1f} y:{char['y0']:.1f}")
358    
359    # Extract text by bounding box (left, top, right, bottom)
360    bbox_text = page.within_bbox((100, 100, 400, 200)).extract_text()
361```
362 
363#### Advanced Table Extraction with Custom Settings
364```python
365import pdfplumber
366import pandas as pd
367 
368with pdfplumber.open("complex_table.pdf") as pdf:
369    page = pdf.pages[0]
370    
371    # Extract tables with custom settings for complex layouts
372    table_settings = {
373        "vertical_strategy": "lines",
374        "horizontal_strategy": "lines",
375        "snap_tolerance": 3,
376        "intersection_tolerance": 15
377    }
378    tables = page.extract_tables(table_settings)
379    
380    # Visual debugging for table extraction
381    img = page.to_image(resolution=150)
382    img.save("debug_layout.png")
383```
384 
385### reportlab Advanced Features
386 
387#### Create Professional Reports with Tables
388```python
389from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph
390from reportlab.lib.styles import getSampleStyleSheet
391from reportlab.lib import colors
392 
393# Sample data
394data = [
395    ['Product', 'Q1', 'Q2', 'Q3', 'Q4'],
396    ['Widgets', '120', '135', '142', '158'],
397    ['Gadgets', '85', '92', '98', '105']
398]
399 
400# Create PDF with table
401doc = SimpleDocTemplate("report.pdf")
402elements = []
403 
404# Add title
405styles = getSampleStyleSheet()
406title = Paragraph("Quarterly Sales Report", styles['Title'])
407elements.append(title)
408 
409# Add table with advanced styling
410table = Table(data)
411table.setStyle(TableStyle([
412    ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
413    ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
414    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
415    ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
416    ('FONTSIZE', (0, 0), (-1, 0), 14),
417    ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
418    ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
419    ('GRID', (0, 0), (-1, -1), 1, colors.black)
420]))
421elements.append(table)
422 
423doc.build(elements)
424```
425 
426## Complex Workflows
427 
428### Extract Figures/Images from PDF
429 
430#### Method 1: Using pdfimages (fastest)
431```bash
432# Extract all images with original quality
433pdfimages -all document.pdf images/img
434```
435 
436#### Method 2: Using pypdfium2 + Image Processing
437```python
438import pypdfium2 as pdfium
439from PIL import Image
440import numpy as np
441 
442def extract_figures(pdf_path, output_dir):
443    pdf = pdfium.PdfDocument(pdf_path)
444    
445    for page_num, page in enumerate(pdf):
446        # Render high-resolution page
447        bitmap = page.render(scale=3.0)
448        img = bitmap.to_pil()
449        
450        # Convert to numpy for processing
451        img_array = np.array(img)
452        
453        # Simple figure detection (non-white regions)
454        mask = np.any(img_array != [255, 255, 255], axis=2)
455        
456        # Find contours and extract bounding boxes
457        # (This is simplified - real implementation would need more sophisticated detection)
458        
459        # Save detected figures
460        # ... implementation depends on specific needs
461```
462 
463### Batch PDF Processing with Error Handling
464```python
465import os
466import glob
467from pypdf import PdfReader, PdfWriter
468import logging
469 
470logging.basicConfig(level=logging.INFO)
471logger = logging.getLogger(__name__)
472 
473def batch_process_pdfs(input_dir, operation='merge'):
474    pdf_files = glob.glob(os.path.join(input_dir, "*.pdf"))
475    
476    if operation == 'merge':
477        writer = PdfWriter()
478        for pdf_file in pdf_files:
479            try:
480                reader = PdfReader(pdf_file)
481                for page in reader.pages:
482                    writer.add_page(page)
483                logger.info(f"Processed: {pdf_file}")
484            except Exception as e:
485                logger.error(f"Failed to process {pdf_file}: {e}")
486                continue
487        
488        with open("batch_merged.pdf", "wb") as output:
489            writer.write(output)
490    
491    elif operation == 'extract_text':
492        for pdf_file in pdf_files:
493            try:
494                reader = PdfReader(pdf_file)
495                text = ""
496                for page in reader.pages:
497                    text += page.extract_text()
498                
499                output_file = pdf_file.replace('.pdf', '.txt')
500                with open(output_file, 'w', encoding='utf-8') as f:
501                    f.write(text)
502                logger.info(f"Extracted text from: {pdf_file}")
503                
504            except Exception as e:
505                logger.error(f"Failed to extract text from {pdf_file}: {e}")
506                continue
507```
508 
509### Advanced PDF Cropping
510```python
511from pypdf import PdfWriter, PdfReader
512 
513reader = PdfReader("input.pdf")
514writer = PdfWriter()
515 
516# Crop page (left, bottom, right, top in points)
517page = reader.pages[0]
518page.mediabox.left = 50
519page.mediabox.bottom = 50
520page.mediabox.right = 550
521page.mediabox.top = 750
522 
523writer.add_page(page)
524with open("cropped.pdf", "wb") as output:
525    writer.write(output)
526```
527 
528## Performance Optimization Tips
529 
530### 1. For Large PDFs
531- Use streaming approaches instead of loading entire PDF in memory
532- Use `qpdf --split-pages` for splitting large files
533- Process pages individually with pypdfium2
534 
535### 2. For Text Extraction
536- `pdftotext -bbox-layout` is fastest for plain text extraction
537- Use pdfplumber for structured data and tables
538- Avoid `pypdf.extract_text()` for very large documents
539 
540### 3. For Image Extraction
541- `pdfimages` is much faster than rendering pages
542- Use low resolution for previews, high resolution for final output
543 
544### 4. For Form Filling
545- pdf-lib maintains form structure better than most alternatives
546- Pre-validate form fields before processing
547 
548### 5. Memory Management
549```python
550# Process PDFs in chunks
551def process_large_pdf(pdf_path, chunk_size=10):
552    reader = PdfReader(pdf_path)
553    total_pages = len(reader.pages)
554    
555    for start_idx in range(0, total_pages, chunk_size):
556        end_idx = min(start_idx + chunk_size, total_pages)
557        writer = PdfWriter()
558        
559        for i in range(start_idx, end_idx):
560            writer.add_page(reader.pages[i])
561        
562        # Process chunk
563        with open(f"chunk_{start_idx//chunk_size}.pdf", "wb") as output:
564            writer.write(output)
565```
566 
567## Troubleshooting Common Issues
568 
569### Encrypted PDFs
570```python
571# Handle password-protected PDFs
572from pypdf import PdfReader
573 
574try:
575    reader = PdfReader("encrypted.pdf")
576    if reader.is_encrypted:
577        reader.decrypt("password")
578except Exception as e:
579    print(f"Failed to decrypt: {e}")
580```
581 
582### Corrupted PDFs
583```bash
584# Use qpdf to repair
585qpdf --check corrupted.pdf
586qpdf --replace-input corrupted.pdf
587```
588 
589### Text Extraction Issues
590```python
591# Fallback to OCR for scanned PDFs
592import pytesseract
593from pdf2image import convert_from_path
594 
595def extract_text_with_ocr(pdf_path):
596    images = convert_from_path(pdf_path)
597    text = ""
598    for i, image in enumerate(images):
599        text += pytesseract.image_to_string(image)
600    return text
601```
602 
603## License Information
604 
605- **pypdf**: BSD License
606- **pdfplumber**: MIT License
607- **pypdfium2**: Apache/BSD License
608- **reportlab**: BSD License
609- **poppler-utils**: GPL-2 License
610- **qpdf**: Apache License
611- **pdf-lib**: MIT License
612- **pdfjs-dist**: Apache License

Marketplace

Source from repo

PDF Processing Guide

Read, create, merge, split, watermark, encrypt, OCR, and fill PDF files using Python and CLI tools

anthropicsGitHub anthropicsOfficialSource repo Original GitHub link Publisher page

Files

Skill

n/a

Size

57.3 KB

Entrypoint

SKILL.md

Format

git-repo

Open file

reference.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown612 linesFree

reference.md

1# PDF Processing Advanced Reference
2 
3This document contains advanced PDF processing features, detailed examples, and additional libraries not covered in the main skill instructions.
4 
5## pypdfium2 Library (Apache/BSD License)
6 
7### Overview
8pypdfium2 is a Python binding for PDFium (Chromium's PDF library). It's excellent for fast PDF rendering, image generation, and serves as a PyMuPDF replacement.
9 
10### Render PDF to Images
11```python
12import pypdfium2 as pdfium
13from PIL import Image
14 
15# Load PDF
16pdf = pdfium.PdfDocument("document.pdf")
17 
18# Render page to image
19page = pdf[0]  # First page
20bitmap = page.render(
21    scale=2.0,  # Higher resolution
22    rotation=0  # No rotation
23)
24 
25# Convert to PIL Image
26img = bitmap.to_pil()
27img.save("page_1.png", "PNG")
28 
29# Process multiple pages
30for i, page in enumerate(pdf):
31    bitmap = page.render(scale=1.5)
32    img = bitmap.to_pil()
33    img.save(f"page_{i+1}.jpg", "JPEG", quality=90)
34```
35 
36### Extract Text with pypdfium2
37```python
38import pypdfium2 as pdfium
39 
40pdf = pdfium.PdfDocument("document.pdf")
41for i, page in enumerate(pdf):
42    text = page.get_text()
43    print(f"Page {i+1} text length: {len(text)} chars")
44```
45 
46## JavaScript Libraries
47 
48### pdf-lib (MIT License)
49 
50pdf-lib is a powerful JavaScript library for creating and modifying PDF documents in any JavaScript environment.
51 
52#### Load and Manipulate Existing PDF
53```javascript
54import { PDFDocument } from 'pdf-lib';
55import fs from 'fs';
56 
57async function manipulatePDF() {
58    // Load existing PDF
59    const existingPdfBytes = fs.readFileSync('input.pdf');
60    const pdfDoc = await PDFDocument.load(existingPdfBytes);
61 
62    // Get page count
63    const pageCount = pdfDoc.getPageCount();
64    console.log(`Document has ${pageCount} pages`);
65 
66    // Add new page
67    const newPage = pdfDoc.addPage([600, 400]);
68    newPage.drawText('Added by pdf-lib', {
69        x: 100,
70        y: 300,
71        size: 16
72    });
73 
74    // Save modified PDF
75    const pdfBytes = await pdfDoc.save();
76    fs.writeFileSync('modified.pdf', pdfBytes);
77}
78```
79 
80#### Create Complex PDFs from Scratch
81```javascript
82import { PDFDocument, rgb, StandardFonts } from 'pdf-lib';
83import fs from 'fs';
84 
85async function createPDF() {
86    const pdfDoc = await PDFDocument.create();
87 
88    // Add fonts
89    const helveticaFont = await pdfDoc.embedFont(StandardFonts.Helvetica);
90    const helveticaBold = await pdfDoc.embedFont(StandardFonts.HelveticaBold);
91 
92    // Add page
93    const page = pdfDoc.addPage([595, 842]); // A4 size
94    const { width, height } = page.getSize();
95 
96    // Add text with styling
97    page.drawText('Invoice #12345', {
98        x: 50,
99        y: height - 50,
100        size: 18,
101        font: helveticaBold,
102        color: rgb(0.2, 0.2, 0.8)
103    });
104 
105    // Add rectangle (header background)
106    page.drawRectangle({
107        x: 40,
108        y: height - 100,
109        width: width - 80,
110        height: 30,
111        color: rgb(0.9, 0.9, 0.9)
112    });
113 
114    // Add table-like content
115    const items = [
116        ['Item', 'Qty', 'Price', 'Total'],
117        ['Widget', '2', '$50', '$100'],
118        ['Gadget', '1', '$75', '$75']
119    ];
120 
121    let yPos = height - 150;
122    items.forEach(row => {
123        let xPos = 50;
124        row.forEach(cell => {
125            page.drawText(cell, {
126                x: xPos,
127                y: yPos,
128                size: 12,
129                font: helveticaFont
130            });
131            xPos += 120;
132        });
133        yPos -= 25;
134    });
135 
136    const pdfBytes = await pdfDoc.save();
137    fs.writeFileSync('created.pdf', pdfBytes);
138}
139```
140 
141#### Advanced Merge and Split Operations
142```javascript
143import { PDFDocument } from 'pdf-lib';
144import fs from 'fs';
145 
146async function mergePDFs() {
147    // Create new document
148    const mergedPdf = await PDFDocument.create();
149 
150    // Load source PDFs
151    const pdf1Bytes = fs.readFileSync('doc1.pdf');
152    const pdf2Bytes = fs.readFileSync('doc2.pdf');
153 
154    const pdf1 = await PDFDocument.load(pdf1Bytes);
155    const pdf2 = await PDFDocument.load(pdf2Bytes);
156 
157    // Copy pages from first PDF
158    const pdf1Pages = await mergedPdf.copyPages(pdf1, pdf1.getPageIndices());
159    pdf1Pages.forEach(page => mergedPdf.addPage(page));
160 
161    // Copy specific pages from second PDF (pages 0, 2, 4)
162    const pdf2Pages = await mergedPdf.copyPages(pdf2, [0, 2, 4]);
163    pdf2Pages.forEach(page => mergedPdf.addPage(page));
164 
165    const mergedPdfBytes = await mergedPdf.save();
166    fs.writeFileSync('merged.pdf', mergedPdfBytes);
167}
168```
169 
170### pdfjs-dist (Apache License)
171 
172PDF.js is Mozilla's JavaScript library for rendering PDFs in the browser.
173 
174#### Basic PDF Loading and Rendering
175```javascript
176import * as pdfjsLib from 'pdfjs-dist';
177 
178// Configure worker (important for performance)
179pdfjsLib.GlobalWorkerOptions.workerSrc = './pdf.worker.js';
180 
181async function renderPDF() {
182    // Load PDF
183    const loadingTask = pdfjsLib.getDocument('document.pdf');
184    const pdf = await loadingTask.promise;
185 
186    console.log(`Loaded PDF with ${pdf.numPages} pages`);
187 
188    // Get first page
189    const page = await pdf.getPage(1);
190    const viewport = page.getViewport({ scale: 1.5 });
191 
192    // Render to canvas
193    const canvas = document.createElement('canvas');
194    const context = canvas.getContext('2d');
195    canvas.height = viewport.height;
196    canvas.width = viewport.width;
197 
198    const renderContext = {
199        canvasContext: context,
200        viewport: viewport
201    };
202 
203    await page.render(renderContext).promise;
204    document.body.appendChild(canvas);
205}
206```
207 
208#### Extract Text with Coordinates
209```javascript
210import * as pdfjsLib from 'pdfjs-dist';
211 
212async function extractText() {
213    const loadingTask = pdfjsLib.getDocument('document.pdf');
214    const pdf = await loadingTask.promise;
215 
216    let fullText = '';
217 
218    // Extract text from all pages
219    for (let i = 1; i <= pdf.numPages; i++) {
220        const page = await pdf.getPage(i);
221        const textContent = await page.getTextContent();
222 
223        const pageText = textContent.items
224            .map(item => item.str)
225            .join(' ');
226 
227        fullText += `\n--- Page ${i} ---\n${pageText}`;
228 
229        // Get text with coordinates for advanced processing
230        const textWithCoords = textContent.items.map(item => ({
231            text: item.str,
232            x: item.transform[4],
233            y: item.transform[5],
234            width: item.width,
235            height: item.height
236        }));
237    }
238 
239    console.log(fullText);
240    return fullText;
241}
242```
243 
244#### Extract Annotations and Forms
245```javascript
246import * as pdfjsLib from 'pdfjs-dist';
247 
248async function extractAnnotations() {
249    const loadingTask = pdfjsLib.getDocument('annotated.pdf');
250    const pdf = await loadingTask.promise;
251 
252    for (let i = 1; i <= pdf.numPages; i++) {
253        const page = await pdf.getPage(i);
254        const annotations = await page.getAnnotations();
255 
256        annotations.forEach(annotation => {
257            console.log(`Annotation type: ${annotation.subtype}`);
258            console.log(`Content: ${annotation.contents}`);
259            console.log(`Coordinates: ${JSON.stringify(annotation.rect)}`);
260        });
261    }
262}
263```
264 
265## Advanced Command-Line Operations
266 
267### poppler-utils Advanced Features
268 
269#### Extract Text with Bounding Box Coordinates
270```bash
271# Extract text with bounding box coordinates (essential for structured data)
272pdftotext -bbox-layout document.pdf output.xml
273 
274# The XML output contains precise coordinates for each text element
275```
276 
277#### Advanced Image Conversion
278```bash
279# Convert to PNG images with specific resolution
280pdftoppm -png -r 300 document.pdf output_prefix
281 
282# Convert specific page range with high resolution
283pdftoppm -png -r 600 -f 1 -l 3 document.pdf high_res_pages
284 
285# Convert to JPEG with quality setting
286pdftoppm -jpeg -jpegopt quality=85 -r 200 document.pdf jpeg_output
287```
288 
289#### Extract Embedded Images
290```bash
291# Extract all embedded images with metadata
292pdfimages -j -p document.pdf page_images
293 
294# List image info without extracting
295pdfimages -list document.pdf
296 
297# Extract images in their original format
298pdfimages -all document.pdf images/img
299```
300 
301### qpdf Advanced Features
302 
303#### Complex Page Manipulation
304```bash
305# Split PDF into groups of pages
306qpdf --split-pages=3 input.pdf output_group_%02d.pdf
307 
308# Extract specific pages with complex ranges
309qpdf input.pdf --pages input.pdf 1,3-5,8,10-end -- extracted.pdf
310 
311# Merge specific pages from multiple PDFs
312qpdf --empty --pages doc1.pdf 1-3 doc2.pdf 5-7 doc3.pdf 2,4 -- combined.pdf
313```
314 
315#### PDF Optimization and Repair
316```bash
317# Optimize PDF for web (linearize for streaming)
318qpdf --linearize input.pdf optimized.pdf
319 
320# Remove unused objects and compress
321qpdf --optimize-level=all input.pdf compressed.pdf
322 
323# Attempt to repair corrupted PDF structure
324qpdf --check input.pdf
325qpdf --fix-qdf damaged.pdf repaired.pdf
326 
327# Show detailed PDF structure for debugging
328qpdf --show-all-pages input.pdf > structure.txt
329```
330 
331#### Advanced Encryption
332```bash
333# Add password protection with specific permissions
334qpdf --encrypt user_pass owner_pass 256 --print=none --modify=none -- input.pdf encrypted.pdf
335 
336# Check encryption status
337qpdf --show-encryption encrypted.pdf
338 
339# Remove password protection (requires password)
340qpdf --password=secret123 --decrypt encrypted.pdf decrypted.pdf
341```
342 
343## Advanced Python Techniques
344 
345### pdfplumber Advanced Features
346 
347#### Extract Text with Precise Coordinates
348```python
349import pdfplumber
350 
351with pdfplumber.open("document.pdf") as pdf:
352    page = pdf.pages[0]
353    
354    # Extract all text with coordinates
355    chars = page.chars
356    for char in chars[:10]:  # First 10 characters
357        print(f"Char: '{char['text']}' at x:{char['x0']:.1f} y:{char['y0']:.1f}")
358    
359    # Extract text by bounding box (left, top, right, bottom)
360    bbox_text = page.within_bbox((100, 100, 400, 200)).extract_text()
361```
362 
363#### Advanced Table Extraction with Custom Settings
364```python
365import pdfplumber
366import pandas as pd
367 
368with pdfplumber.open("complex_table.pdf") as pdf:
369    page = pdf.pages[0]
370    
371    # Extract tables with custom settings for complex layouts
372    table_settings = {
373        "vertical_strategy": "lines",
374        "horizontal_strategy": "lines",
375        "snap_tolerance": 3,
376        "intersection_tolerance": 15
377    }
378    tables = page.extract_tables(table_settings)
379    
380    # Visual debugging for table extraction
381    img = page.to_image(resolution=150)
382    img.save("debug_layout.png")
383```
384 
385### reportlab Advanced Features
386 
387#### Create Professional Reports with Tables
388```python
389from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph
390from reportlab.lib.styles import getSampleStyleSheet
391from reportlab.lib import colors
392 
393# Sample data
394data = [
395    ['Product', 'Q1', 'Q2', 'Q3', 'Q4'],
396    ['Widgets', '120', '135', '142', '158'],
397    ['Gadgets', '85', '92', '98', '105']
398]
399 
400# Create PDF with table
401doc = SimpleDocTemplate("report.pdf")
402elements = []
403 
404# Add title
405styles = getSampleStyleSheet()
406title = Paragraph("Quarterly Sales Report", styles['Title'])
407elements.append(title)
408 
409# Add table with advanced styling
410table = Table(data)
411table.setStyle(TableStyle([
412    ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
413    ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
414    ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
415    ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
416    ('FONTSIZE', (0, 0), (-1, 0), 14),
417    ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
418    ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
419    ('GRID', (0, 0), (-1, -1), 1, colors.black)
420]))
421elements.append(table)
422 
423doc.build(elements)
424```
425 
426## Complex Workflows
427 
428### Extract Figures/Images from PDF
429 
430#### Method 1: Using pdfimages (fastest)
431```bash
432# Extract all images with original quality
433pdfimages -all document.pdf images/img
434```
435 
436#### Method 2: Using pypdfium2 + Image Processing
437```python
438import pypdfium2 as pdfium
439from PIL import Image
440import numpy as np
441 
442def extract_figures(pdf_path, output_dir):
443    pdf = pdfium.PdfDocument(pdf_path)
444    
445    for page_num, page in enumerate(pdf):
446        # Render high-resolution page
447        bitmap = page.render(scale=3.0)
448        img = bitmap.to_pil()
449        
450        # Convert to numpy for processing
451        img_array = np.array(img)
452        
453        # Simple figure detection (non-white regions)
454        mask = np.any(img_array != [255, 255, 255], axis=2)
455        
456        # Find contours and extract bounding boxes
457        # (This is simplified - real implementation would need more sophisticated detection)
458        
459        # Save detected figures
460        # ... implementation depends on specific needs
461```
462 
463### Batch PDF Processing with Error Handling
464```python
465import os
466import glob
467from pypdf import PdfReader, PdfWriter
468import logging
469 
470logging.basicConfig(level=logging.INFO)
471logger = logging.getLogger(__name__)
472 
473def batch_process_pdfs(input_dir, operation='merge'):
474    pdf_files = glob.glob(os.path.join(input_dir, "*.pdf"))
475    
476    if operation == 'merge':
477        writer = PdfWriter()
478        for pdf_file in pdf_files:
479            try:
480                reader = PdfReader(pdf_file)
481                for page in reader.pages:
482                    writer.add_page(page)
483                logger.info(f"Processed: {pdf_file}")
484            except Exception as e:
485                logger.error(f"Failed to process {pdf_file}: {e}")
486                continue
487        
488        with open("batch_merged.pdf", "wb") as output:
489            writer.write(output)
490    
491    elif operation == 'extract_text':
492        for pdf_file in pdf_files:
493            try:
494                reader = PdfReader(pdf_file)
495                text = ""
496                for page in reader.pages:
497                    text += page.extract_text()
498                
499                output_file = pdf_file.replace('.pdf', '.txt')
500                with open(output_file, 'w', encoding='utf-8') as f:
501                    f.write(text)
502                logger.info(f"Extracted text from: {pdf_file}")
503                
504            except Exception as e:
505                logger.error(f"Failed to extract text from {pdf_file}: {e}")
506                continue
507```
508 
509### Advanced PDF Cropping
510```python
511from pypdf import PdfWriter, PdfReader
512 
513reader = PdfReader("input.pdf")
514writer = PdfWriter()
515 
516# Crop page (left, bottom, right, top in points)
517page = reader.pages[0]
518page.mediabox.left = 50
519page.mediabox.bottom = 50
520page.mediabox.right = 550
521page.mediabox.top = 750
522 
523writer.add_page(page)
524with open("cropped.pdf", "wb") as output:
525    writer.write(output)
526```
527 
528## Performance Optimization Tips
529 
530### 1. For Large PDFs
531- Use streaming approaches instead of loading entire PDF in memory
532- Use `qpdf --split-pages` for splitting large files
533- Process pages individually with pypdfium2
534 
535### 2. For Text Extraction
536- `pdftotext -bbox-layout` is fastest for plain text extraction
537- Use pdfplumber for structured data and tables
538- Avoid `pypdf.extract_text()` for very large documents
539 
540### 3. For Image Extraction
541- `pdfimages` is much faster than rendering pages
542- Use low resolution for previews, high resolution for final output
543 
544### 4. For Form Filling
545- pdf-lib maintains form structure better than most alternatives
546- Pre-validate form fields before processing
547 
548### 5. Memory Management
549```python
550# Process PDFs in chunks
551def process_large_pdf(pdf_path, chunk_size=10):
552    reader = PdfReader(pdf_path)
553    total_pages = len(reader.pages)
554    
555    for start_idx in range(0, total_pages, chunk_size):
556        end_idx = min(start_idx + chunk_size, total_pages)
557        writer = PdfWriter()
558        
559        for i in range(start_idx, end_idx):
560            writer.add_page(reader.pages[i])
561        
562        # Process chunk
563        with open(f"chunk_{start_idx//chunk_size}.pdf", "wb") as output:
564            writer.write(output)
565```
566 
567## Troubleshooting Common Issues
568 
569### Encrypted PDFs
570```python
571# Handle password-protected PDFs
572from pypdf import PdfReader
573 
574try:
575    reader = PdfReader("encrypted.pdf")
576    if reader.is_encrypted:
577        reader.decrypt("password")
578except Exception as e:
579    print(f"Failed to decrypt: {e}")
580```
581 
582### Corrupted PDFs
583```bash
584# Use qpdf to repair
585qpdf --check corrupted.pdf
586qpdf --replace-input corrupted.pdf
587```
588 
589### Text Extraction Issues
590```python
591# Fallback to OCR for scanned PDFs
592import pytesseract
593from pdf2image import convert_from_path
594 
595def extract_text_with_ocr(pdf_path):
596    images = convert_from_path(pdf_path)
597    text = ""
598    for i, image in enumerate(images):
599        text += pytesseract.image_to_string(image)
600    return text
601```
602 
603## License Information
604 
605- **pypdf**: BSD License
606- **pdfplumber**: MIT License
607- **pypdfium2**: Apache/BSD License
608- **reportlab**: BSD License
609- **poppler-utils**: GPL-2 License
610- **qpdf**: Apache License
611- **pdf-lib**: MIT License
612- **pdfjs-dist**: Apache License

PDF Processing Guide

reference.md

Preparing the source view

PDF Processing Guide

reference.md