English
Enterprise

【Practical】Lossless PDF Compression: Structural Optimization Solution Based on PyMuPDF

Doclingo TeamJanuary 30, 2026

【Practical】Lossless PDF Compression: Structural Optimization Solution Based on PyMuPDF

In our daily work, we often encounter the problem of oversized PDF files: email attachment size limits, slow upload and download speeds, and excessive storage space usage. Traditional compression methods either reduce image quality, resulting in blurriness, or convert text into images, losing searchability, which is simply not worth it.

Today, I will share a lossless PDF compression solution—optimizing the internal structure of PDFs and cleaning redundant data using the PyMuPDF (fitz) library, effectively reducing file size while perfectly preserving text searchability and visual clarity.

I. Core Implementation Principle

Here is the core code, which is the key logic for implementing PDF structural optimization and compression:

import os
from pathlib import Path
import fitz  # PyMuPDF

def compress_pdf_simple(input_path, output_path=None):
    """
    Simple PDF compression method - only performs structural optimization (lossless)
    Parameters:
        input_path (str): Input PDF file path
        output_path (str): Output PDF file path
    Returns:
        str: Output file path
    """
    try:
        # Open the PDF file
        doc = fitz.open(input_path)
        
        # Smartly generate output path (if not specified)
        if output_path is None:
            input_file = Path(input_path)
            output_path = str(input_file.parent / f"{input_file.stem}_simple_compressed{input_file.suffix}")
        
        # Core: Achieve lossless compression through parameter combination
        doc.save(
            output_path,
            garbage=4,  # Maximize cleaning of unused objects
            deflate=True,  # Use deflate lossless compression algorithm
            clean=True,  # Clean/optimize PDF internal structure
            pretty=False  # Compact output, remove whitespace characters
        )
        
        # Close the document to release resources
        doc.close()
        
        # Calculate compression information
        original_size = os.path.getsize(input_path)
        compressed_size = os.path.getsize(output_path)
        compression_ratio = (1 - compressed_size / original_size) * 100
        
        # Output compression results
        print(f"✅ Simple compression completed!")
        print(f"📄 Original file: {input_path} ({original_size / 1024 / 1024:.2f} MB)")
        print(f"📦 Compressed file: {output_path} ({compressed_size / 1024 / 1024:.2f} MB)")
        print(f"📉 Compression ratio: {compression_ratio:.1f}%")
        print(f"🔍 Retain text searchability: Yes")
        
        return output_path

    except Exception as e:
        print(f"❌ An error occurred during compression: {str(e)}")
        return None

# Example call
# compress_pdf_simple("your_file.pdf")

The core of this solution is to utilize the four key parameters of PyMuPDF's save() method to achieve "slimming" of the PDF without modifying the document content itself.

II. In-Depth Analysis of Key Parameters

2.1 garbage=4: Precise Cleaning of "Orphan" Objects

PDF files contain a large number of indirect objects (pages, fonts, images, annotations, etc.), and deleting content during document editing does not immediately clean up the underlying references, leading to a buildup of "orphan objects."

garbage parameter valueCleaning LevelApplicable Scenarios
0No cleaningQuick save only, no compression needed
1Clean obvious unused objectsLight optimization, prioritize compatibility
2Deep check of reference relationshipsRegular optimization, balance effect and safety
3Aggressive cleaningMaximize compression for simple PDF files
4Maximum cleaningDocuments edited multiple times, need to verify integrity

Process Flow: Traverse all indirect objects → Establish reference relationship graph → Delete unreferenced objects → Free up storage space.
Actual Effect: Documents edited multiple times can reduce size by 10-30%.

2.2 deflate=True: Universal Lossless Compression Algorithm

Deflate is the lossless compression algorithm recommended by the PDF specification (based on LZ77 + Huffman coding), with maximum compatibility, and almost all PDF readers can decode it.

When enabled, it compresses the following objects:

  • Page content streams
  • Font data streams
  • Image data streams not compressed by other algorithms

2.3 clean=True: Optimize Document Structure

PDF consists of two parts: "document structure" and "content streams." The clean=True option specifically optimizes the structural part:

  • Remove duplicate PDF objects
  • Merge references of identical content objects
  • Optimize page tree structure
  • Clean redundant metadata information

Actual Effect: Multi-page documents can reduce size by 5-15%.

2.4 pretty=False: Compact Output

PDF is essentially a binary format. pretty=True retains indentation, line breaks, and other whitespace characters to enhance readability, while pretty=False removes all unnecessary whitespace characters, further reducing size (this has a supplementary effect on files already deflated, but it's better than nothing).

III. Complete Compression Process

PDF Compression Flowchart

IV. Technical Challenges and Solutions

4.1 Compression Ratio vs Document Integrity

Challenge: Aggressive garbage collection (garbage=4) may damage complex PDFs (containing scripts/forms).
Solution:

  • Verify document integrity before compression
  • Encrypted PDFs need to be decrypted before compression
  • Use try-except to catch exceptions and avoid program crashes

4.2 Retaining Text Searchability

Challenge: Some compression solutions may convert text into images, making it unselectable/unsearchable.
Advantages of this solution:

  • Does not modify text content and encoding
  • Only optimizes object reference relationships
  • Retains font object integrity
  • Verification method: Test text selection functionality with a PDF reader after compression

4.3 Memory Management for Large Files

Challenge: PDFs of several hundred MB can easily lead to memory overflow.
Optimization Suggestions:

  • Process large files in pages
  • Close doc objects promptly to release resources
  • Monitor memory usage and adopt streaming processing

4.4 Smartly Generating Output Paths

Utilize the pathlib library to automatically generate paths, maintaining the original file name + compression tag, ensuring cross-platform compatibility:

if output_path is None:
    input_file = Path(input_path)
    output_path = str(input_file.parent / f"{input_file.stem}_simple_compressed{input_file.suffix}")

V. Use Cases and Effect Estimation

5.1 Applicable Scenarios

  • Academic documents: PDFs downloaded from arXiv often contain a lot of redundant information
  • E-book archiving: Compressing saves storage space
  • Document transmission: Pre-process before uploading to email/cloud storage
  • Batch processing: Combine with Celery for automated compression

5.2 Effect Estimation

Document TypeTypical Compression RatioRemarks
Pure text documents10-25%Most significant effect
Contains many images5-15%Only relies on structural optimization
Documents edited multiple times20-40%Significant garbage collection effect
Scanned PDFs0-5%Already in image format, limited optimization

5.3 Precautions

  • Backup original files: Compression is irreversible
  • Verify content integrity: Comprehensive checks after compression
  • Test printing effects: Avoid impacting output
  • For large batch processing, a distributed architecture is recommended

VI. Comparison with Similar Solutions

Compression SolutionCompression RatioQuality LossText SearchabilityImplementation Complexity
PyMuPDF Structural Optimization10-30%NoneRetainedLow
Image Quality Reduction Compression30-70%ObviousRetainedMedium
Re-encoding PDF20-50%Possible lossPossible lossHigh

VII. Performance Optimization Suggestions

  1. Batch Processing: Use multiprocessing to concurrently process multiple files, improving efficiency
  2. Progress Monitoring: Add progress callback functions to enhance user experience
  3. Incremental Compression: Only compress changed pages in the PDF to reduce repetitive operations
  4. Caching Mechanism: Record already compressed files to avoid duplicate processing

VIII. Conclusion

Key Points Review

  1. This solution is based on PyMuPDF's four core parameters: garbage=4, deflate=True, clean=True, pretty=False, achieving lossless PDF compression;
  2. The advantages of the solution are that it does not compromise document quality, retains text searchability, and is simple to implement with good compatibility;
  3. The compression effects vary significantly for different types of PDFs, with limited optimization for scanned documents, while pure text documents edited multiple times yield the best results.

This lossless PDF compression solution balances practicality and safety, with simple and easily integrable code, suitable for most document processing scenarios. If a higher compression ratio is needed, it can also be combined with lossless image compression (such as optimizing DPI), but care should be taken to balance effect and complexity.

Copyright © 2026 Doclingo. All Rights Reserved.
Products
Document Translation
More Tools
API
Enterprise
Resources
Pricing
App
About
Help Center
Terms of Service
Privacy Policy
Version Updates
Blog
Contact Information
Email: support@doclingo.ai
English
Copyright © 2026 Doclingo. All Rights Reserved.