Suppose you have a PDF file, but it’s too large and you’d like to compress it (perhaps you want to reduce its size to allow for faster transfer over the internet, or perhaps to save storage space).
Even more challenging, suppose you have multiple PDF files you’d like to compress.
Multiple online options exist, but these typically allow a limited number of files to be processed at a time. Also of course there is the extra time involved in uploading the originals, then downloading the results. And of course, perhaps you are not comfortable sharing your files with the internet.
Fortunately, we can use Python to address all these concerns. But before we learn how to do this, let’s first learn a little bit about PDF files.
About Compressing PDF Files
According to Dov Isaacs, former Adobe Principal Scientist (see his discussion here) PDF documents are already substantially compressed.
The text and vector graphics portions of the documents are already internally zip-compressed, so there is little opportunity for improvement there.
Instead, any file compression improvements are achieved through compression of image portions of PDF documents, along with potential loss of image quality.
So compression might be achievable, but the user must choose between how much compression versus how much image quality loss is acceptable.
A programmer going by the handle Theeko74 has written a Python script called “
pdf_compressor.py”. This script is a wrapper for
ghostscript functions that do the actual work of compressing PDF files.
This script is offered under the MIT license and is free to use as the user wishes.
💡 Hint: make sure you have
ghostscript installed on your computer. To install
ghostscript, follow this detailed guide and come back afterward.
pdf_compressor.py from GitHub here.
Ultimately we will be writing a Python script to perform the compression.
So we create a directory to hold the script, and use our preferred editor or IDE to create it (this example uses Linux command line to make the directory, and uses
vim as the editor to make script “
bpdfc.py”; use your preferred choice for creating the directory and creating the script within it):
$ mkdir batchPDFcomp $ cd batchPDFcomp $ vim bpdfc.py
We won’t write out the script just yet – we’ll show some details for the script a little later in this article.
When we do write the script, within it we’ll import “
pdf_compressor.py” as a module.
To prepare for this we should create a subdirectory below our Python script directory.
Also, we’ll need to copy
pdf_compressor.py into that subdirectory, and we’ll need to create a file
__init__.py within the same subdirectory (those are double underscores each side of ‘
$ mkdir pdfc $ cp ~/Downloads/pdf_compressor.py ~/batchPDFcomp/pdfc/ $ cd pdfc $ vim __init__.py
What we have done here is created a local package
pdfc containing a module
💡 Note: The presence of file
__init__.py indicates to Python that that directory is part of a package, and to look there for modules.
Now we are ready to write our script.
The PDF Compression Python Script
Here is our script:
from pdfc.pdf_compressor import compress compress('Finxter_WorldsMostDensePythonCheatSheet.pdf', 'Finxter_WorldsMostDensePythonCheatSheet_compr.pdf', power=4)
As you can see it’s a very short script.
First we import the “
compress” function from “
Then we call the “
compress” function. The function takes as arguments: the input file path, the output file path, and a ‘
power’ argument that sets compression as follows, from least compression to most (according to the documentation in the script):
Running the Script
Now we can run our script:
$ python bpdfc.py Compress PDF... Compression by 51%. Final file size is 0.2MB Done. $
We have only compressed one PDF document in this example, but by modifying the script to loop through multiple PDF documents one can compress multiple files at once.
However, we leave that as an exercise for the reader!
We hope you have found this article useful. Thank you for reading, and we wish you happy coding!
👉 Recommended Tutorial: How to Compress Images in Python