5 Best Ways to Download All PDF Files with Selenium Python

πŸ’‘ Problem Formulation: You want to automate the process of downloading multiple PDF files from a webpage using Python and Selenium WebDriver. Imagine a scenario where you navigate to a resource page full of report links, and you wish to download each report provided as a PDF file without manually clicking each link. This article demonstrates how to achieve this task programmatically.

Method 1: Using Selenium WebDriver with Chrome

This method leverages the Chrome WebDriver to navigate through the site, find PDF file links, and instruct the browser to download them. It takes advantage of Chrome’s ability to set preferences for automatic downloads, enabling the download of all PDFs without user intervention.

Here’s an example:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option('prefs', {
    "download.default_directory": "/path/to/download/directory",  # Change default directory for downloads
    "download.prompt_for_download": False,  # To auto download the file
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True  # It will not show PDF directly in chrome
})

driver = webdriver.Chrome(options=options)
driver.get('http://example.com/pdfpage.html')

pdf_links = driver.find_elements_by_xpath("//a[contains(@href, '.pdf')]")
for link in pdf_links:
    href = link.get_attribute('href')
    driver.get(href)

The output will be the PDF files downloaded to the specified directory.

This code initializes the Chrome WebDriver with specific options that automate the download of PDF files when a link is clicked. The script finds all links ending with ‘.pdf’ and navigates to each link, triggering the downloads.

Method 2: Modifying HTTP Request Headers

In this method, we manipulate the HTTP request headers to send along with the WebDriver requests. By setting the correct headers, we can force the server to send us the PDF file directly, which Selenium can then save to disk.

Here’s an example:

import requests
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://example.com/pdfpage.html')

pdf_links = driver.find_elements_by_xpath("//a[contains(@href, '.pdf')]")

for link in pdf_links:
    href = link.get_attribute('href')
    response = requests.get(href, stream=True)
    with open('destination_path.pdf', 'wb') as f:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

The output is the PDF files saved at the specified local path (‘destination_path.pdf’).

This snippet uses a combination of Selenium to find the PDF file links and the requests library to download the files. The requests library fetches the PDF files while streaming the download to handle large files efficiently.

Method 3: Utilizing the Firefox Profile with Selenium

This approach is similar to method 1 but uses Firefox instead of Chrome. It demonstrates setting up a Firefox profile that automatically downloads files without asking, specifically tailored for PDF file downloads.

Here’s an example:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile

profile = FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)  # Use the last directory specified for saving a file
profile.set_preference("browser.download.dir", "/path/to/download/directory")  # path to save the file
profile.set_preference("browser.download.useDownloadDir", True)
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")  # Don't ask to save to disk for pdfs

options = Options()
driver = webdriver.Firefox(firefox_profile=profile, options=options)
driver.get('http://example.com/pdfpage.html')

pdf_links = driver.find_elements_by_xpath("//a[contains(@href, '.pdf')]")
for link in pdf_links:
    href = link.get_attribute('href')
    driver.get(href)

The output will be the PDF files downloaded in the specified directory.

By setting the Firefox profile preferences, Selenium commands Firefox to download PDF files directly without prompting. The script then proceeds as before: finding and clicking on PDF links.

Method 4: Intercepting Downloads Using a Proxy Server

Intercepting and handling downloads through a proxy server allows for manipulation or monitoring of web traffic. This could be utilized to download PDF files by capturing their requests and redirecting them to a local file.

Here’s an example:

# This example assumes that you have a working proxy setup that can intercept and handle file downloads.
# Implementing a proxy server is beyond the scope of this article, but there are multiple libraries available for creating one.

from selenium import webdriver

proxy = 'your_proxy:port'
webdriver.DesiredCapabilities.FIREFOX['proxy'] = {
    "httpProxy": proxy,
    "ftpProxy": proxy,
    "sslProxy": proxy,
    "proxyType": "MANUAL",
}

driver = webdriver.Firefox()
driver.get('http://example.com/pdfpage.html')

# Normal link finding and clicking operations would follow

The output is contingent on the proxy server’s configuration to handle and save the files.

The code configures Selenium to use a proxy server. Once the browser requests a download, the proxy server catches this and can then handle it in a manner specified by its configuration, potentially saving the PDF locally.

Bonus One-Liner Method 5: Browser-less Download with Requests and BeautifulSoup

For a fast, headless operation, Python’s requests library combined with BeautifulSoup can identify and download PDF files without the overhead of a full browser instance.

Here’s an example:

import requests
from bs4 import BeautifulSoup

response = requests.get('http://example.com/pdfpage.html')
soup = BeautifulSoup(response.text, 'html.parser')
pdf_links = [a['href'] for a in soup.find_all('a', href=True) if a['href'].endswith('.pdf')]

for link in pdf_links:
    r = requests.get(link)
    with open(link.split('/')[-1], 'wb') as f:
        f.write(r.content)

The output will be the PDF files saved in the current working directory.

The code retrieves the HTML content, parses for PDF links, and downloads each file directly. It’s an efficient method in terms of both memory and processing, though it lacks the complexity and features of Selenium WebDriver.

Summary/Discussion

  • Method 1: Chrome WebDriver. Easy to implement. Limited to Chrome.
  • Method 2: HTTP Headers with requests. More control over downloads without using a browser. Requires additional libraries.
  • Method 3: Firefox Profile. Similar to Method 1 but for Firefox users. Requires setup of Firefox profile.
  • Method 4: Proxy Server. High control over web traffic. Complex setup and configuration.
  • Method 5: Requests and BeautifulSoup. Fast and lightweight. Not suitable for JavaScript-heavy websites where links are dynamically generated.