π‘ Problem Formulation: You want to automate the process of downloading multiple PDF files from a webpage using Python and Selenium WebDriver. Imagine a scenario where you navigate to a resource page full of report links, and you wish to download each report provided as a PDF file without manually clicking each link. This article demonstrates how to achieve this task programmatically.
Method 1: Using Selenium WebDriver with Chrome
This method leverages the Chrome WebDriver to navigate through the site, find PDF file links, and instruct the browser to download them. It takes advantage of Chrome’s ability to set preferences for automatic downloads, enabling the download of all PDFs without user intervention.
Here’s an example:
from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options() options.add_experimental_option('prefs', { "download.default_directory": "/path/to/download/directory", # Change default directory for downloads "download.prompt_for_download": False, # To auto download the file "download.directory_upgrade": True, "plugins.always_open_pdf_externally": True # It will not show PDF directly in chrome }) driver = webdriver.Chrome(options=options) driver.get('http://example.com/pdfpage.html') pdf_links = driver.find_elements_by_xpath("//a[contains(@href, '.pdf')]") for link in pdf_links: href = link.get_attribute('href') driver.get(href)
The output will be the PDF files downloaded to the specified directory.
This code initializes the Chrome WebDriver with specific options that automate the download of PDF files when a link is clicked. The script finds all links ending with ‘.pdf’ and navigates to each link, triggering the downloads.
Method 2: Modifying HTTP Request Headers
In this method, we manipulate the HTTP request headers to send along with the WebDriver requests. By setting the correct headers, we can force the server to send us the PDF file directly, which Selenium can then save to disk.
Here’s an example:
import requests from selenium import webdriver driver = webdriver.Chrome() driver.get('http://example.com/pdfpage.html') pdf_links = driver.find_elements_by_xpath("//a[contains(@href, '.pdf')]") for link in pdf_links: href = link.get_attribute('href') response = requests.get(href, stream=True) with open('destination_path.pdf', 'wb') as f: for chunk in response.iter_content(chunk_size=8192): if chunk: f.write(chunk)
The output is the PDF files saved at the specified local path (‘destination_path.pdf’).
This snippet uses a combination of Selenium to find the PDF file links and the requests library to download the files. The requests library fetches the PDF files while streaming the download to handle large files efficiently.
Method 3: Utilizing the Firefox Profile with Selenium
This approach is similar to method 1 but uses Firefox instead of Chrome. It demonstrates setting up a Firefox profile that automatically downloads files without asking, specifically tailored for PDF file downloads.
Here’s an example:
from selenium import webdriver from selenium.webdriver.firefox.options import Options from selenium.webdriver.firefox.firefox_profile import FirefoxProfile profile = FirefoxProfile() profile.set_preference("browser.download.folderList", 2) # Use the last directory specified for saving a file profile.set_preference("browser.download.dir", "/path/to/download/directory") # path to save the file profile.set_preference("browser.download.useDownloadDir", True) profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf") # Don't ask to save to disk for pdfs options = Options() driver = webdriver.Firefox(firefox_profile=profile, options=options) driver.get('http://example.com/pdfpage.html') pdf_links = driver.find_elements_by_xpath("//a[contains(@href, '.pdf')]") for link in pdf_links: href = link.get_attribute('href') driver.get(href)
The output will be the PDF files downloaded in the specified directory.
By setting the Firefox profile preferences, Selenium commands Firefox to download PDF files directly without prompting. The script then proceeds as before: finding and clicking on PDF links.
Method 4: Intercepting Downloads Using a Proxy Server
Intercepting and handling downloads through a proxy server allows for manipulation or monitoring of web traffic. This could be utilized to download PDF files by capturing their requests and redirecting them to a local file.
Here’s an example:
# This example assumes that you have a working proxy setup that can intercept and handle file downloads. # Implementing a proxy server is beyond the scope of this article, but there are multiple libraries available for creating one. from selenium import webdriver proxy = 'your_proxy:port' webdriver.DesiredCapabilities.FIREFOX['proxy'] = { "httpProxy": proxy, "ftpProxy": proxy, "sslProxy": proxy, "proxyType": "MANUAL", } driver = webdriver.Firefox() driver.get('http://example.com/pdfpage.html') # Normal link finding and clicking operations would follow
The output is contingent on the proxy server’s configuration to handle and save the files.
The code configures Selenium to use a proxy server. Once the browser requests a download, the proxy server catches this and can then handle it in a manner specified by its configuration, potentially saving the PDF locally.
Bonus One-Liner Method 5: Browser-less Download with Requests and BeautifulSoup
For a fast, headless operation, Python’s requests library combined with BeautifulSoup can identify and download PDF files without the overhead of a full browser instance.
Here’s an example:
import requests from bs4 import BeautifulSoup response = requests.get('http://example.com/pdfpage.html') soup = BeautifulSoup(response.text, 'html.parser') pdf_links = [a['href'] for a in soup.find_all('a', href=True) if a['href'].endswith('.pdf')] for link in pdf_links: r = requests.get(link) with open(link.split('/')[-1], 'wb') as f: f.write(r.content)
The output will be the PDF files saved in the current working directory.
The code retrieves the HTML content, parses for PDF links, and downloads each file directly. It’s an efficient method in terms of both memory and processing, though it lacks the complexity and features of Selenium WebDriver.
Summary/Discussion
- Method 1: Chrome WebDriver. Easy to implement. Limited to Chrome.
- Method 2: HTTP Headers with requests. More control over downloads without using a browser. Requires additional libraries.
- Method 3: Firefox Profile. Similar to Method 1 but for Firefox users. Requires setup of Firefox profile.
- Method 4: Proxy Server. High control over web traffic. Complex setup and configuration.
- Method 5: Requests and BeautifulSoup. Fast and lightweight. Not suitable for JavaScript-heavy websites where links are dynamically generated.