5 Best Ways to Automatically Download a PDF with Selenium WebDriver in Python

Rate this post

πŸ’‘ Problem Formulation: In many web automation tasks, one common requirement is to automatically download PDF files from a website. Whether for data analysis, record keeping, or archiving, automating this process saves time and effort. This article assumes that you need to download a PDF file from a specific URL using Selenium WebDriver with Python, and you want the file to be saved directly to a designated local directory without any manual intervention.

Method 1: Customize Browser Preferences

This method involves setting the browser preferences to facilitate automatic downloading of PDF files. Specifically, we change the settings in the browser profile to disable the PDF viewer plugin and set a default download directory. This ensures the PDF downloads without user interaction and gets saved directly to a specified path.

Here’s an example:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_experimental_option('prefs', {
    "download.default_directory": "/path/to/download/directory", # Change default directory for downloads
    "download.prompt_for_download": False, # To auto download the file
    "plugins.always_open_pdf_externally": True, # It will not show PDF directly in chrome
})

driver = webdriver.Chrome(executable_path='/path/to/chromedriver', options=chrome_options)
driver.get('http://example.com/somefile.pdf')

Output: The PDF file is downloaded to the specified directory without any prompts.

This code snippet configures Chrome to automatically download PDF files to a given directory without asking for a download location each time. The Chrome WebDriver is initiated with pre-defined options that include disabling the PDF viewer and setting up automatic downloads.

Method 2: Using Firefox and Custom Profile

Similar to Method 1, we can achieve the PDF download automatically by setting up a custom Firefox profile. The profile will have the preferences set such that downloading a PDF won’t bring up any dialog boxes or prompts, making it hand-free and seamless.

Here’s an example:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", "/path/to/download/directory")
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")

driver = webdriver.Firefox(firefox_profile=profile)
driver.get('http://example.com/somefile.pdf')

Output: The PDF file is downloaded to the specified directory without any prompts.

By creating a custom Firefox profile and applying certain preferences, we ensure that the browser saves the PDF directly to the given path without any interruptions. We set the MIME type for PDF to tell Firefox not to prompt for action once a download is initiated.

Method 3: Headless Mode Download

Headless mode can be beneficial for PDF downloading on servers or environments without a UI. While the browser does not visibly open, it operates in the background, and with the correct setup, it can also download files as required.

Here’s an example:

# Using Chrome in headless mode
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_experimental_option("prefs", {"download.default_directory": "/path/to/download/directory"})

driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('http://example.com/somefile.pdf')

Output: The PDF is downloaded without any browser UI interaction in the specified directory.

By running Chrome in headless mode with the relevant options set, we are able to download files without having the browser UI interrupt the process. This method is crucial for automated workflows on headless servers or in continuous integration systems.

Method 4: Handling Download Pop-up with WebDriverWait

If the automatic download methods above aren’t suitable, we can also programmatically accept the download prompt using explicit waits. This isn’t as seamless as disabling the dialog, but it’s a necessary technique in some situations.

Here’s an example:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome('/path/to/chromedriver')
driver.get('http://example.com')
download_link = driver.find_element_by_link_text("Download PDF")
download_link.click()

# Wait until the download button in popup is clickable
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "download"))).click()

Output: The PDF is downloaded after the script clicks on the download button from the popup.

This approach requires navigating through the UI elements with Selenium, waiting for the necessary elements to be interactable, and simulating the user’s actions to approve the file download.

Bonus One-Liner Method 5: Quick Download with Requests

If setting up the browser is not an option or overkill for your needs, using the Python Requests library to quickly download the file might be the best solution. Please note that this approach bypasses Selenium.

Here’s an example:

import requests

url = 'http://example.com/somefile.pdf'
r = requests.get(url, allow_redirects=True)
open('/path/to/download/directory/somefile.pdf', 'wb').write(r.content)

Output: The PDF file is saved to the local directory.

This one-liner requests the PDF from the URL and writes the content to a file in binary mode. This method is efficient and straightforward but does not make use of the Selenium WebDriver.

Summary/Discussion

  • Method 1: Customize Browser Preferences. Strengths: Directly saves PDF files to the specified location. No need to handle any popups. Weaknesses: Limited to Chrome. May require additional configuration for different browsers.
  • Method 2: Using Firefox and Custom Profile. Strengths: Offers more control over browser behavior. Good for situations where Chrome is not preferred. Weaknesses: Similar to Method 1, it may not handle all popups and could be browser-specific.
  • Method 3: Headless Mode Download. Strengths: Ideal for server-side automation where UI is not available. Weaknesses: Does not work if JavaScript interactions are required to trigger the download.
  • Method 4: Handling Download Pop-up with WebDriverWait. Strengths: Useful when automatic download isn’t possible. Simulates natural user interaction. Weaknesses: Depends on the UI, which could change and break the script.
  • Method 5: Quick Download with Requests. Strengths: Simplest method. Does not depend on Selenium or a web browser. Weaknesses: Cannot handle JavaScript or authentication that might be required in the web page.