5 Best Ways to Save a Web Page with Python Selenium

Rate this post

πŸ’‘ Problem Formulation: Web scrapers and automation engineers often face the requirement to save a complete web page for further analysis or archival. Python’s Selenium WebDriver provides several methods to accomplish this, facilitating tasks ranging from testing to data scraping. This article explains how to save a web page’s entire content, including HTML, CSS, and JavaScript-generated data, as you would manually from a browser.

Method 1: Saving Page Source to a File

This method involves extracting the page source – the HTML content – of the web page and saving it to a local file. It is effective for static content and is executed using the page_source attribute of the Selenium WebDriver instance.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://example.com')

with open('page.html', 'w') as file:
    file.write(driver.page_source)

driver.quit()

The output will be the HTML content of ‘https://example.com’ saved as ‘page.html’.

The provided code snippet is straightforward; it launches the Chrome browser, navigates to the provided URL, writes the page source to a file named ‘page.html’, and then closes the browser. It’s a simple yet effective way to capture the static HTML of a page.

Method 2: Taking a Screenshot

Taking a screenshot of a webpage can be particularly useful for capturing the rendered state, which includes images, CSS-styled elements, and in some cases even dynamic content. Selenium’s get_screenshot_as_file method is utilized for this purpose.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://example.com')

driver.get_screenshot_as_file('screenshot.png')
driver.quit()

The output is a ‘screenshot.png’ file with a pixel-perfect capture of the website.

This code fires up Chrome, navigates to the specified URL, and then takes a full-page screenshot of the web page, saving it as ‘screenshot.png’. It closes the browser afterward. This method allows you to capture the webpage as seen by the users, including all visual elements.

Method 3: Downloading Resources

In some scenarios, you may want to download all resources such as images, CSS, and JavaScript files referenced by the web page. Selenium can be combined with Requests library to accomplish this by parsing the HTML and downloading resources.

Here’s an example:

from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import os

driver = webdriver.Chrome()
driver.get('https://example.com')

soup = BeautifulSoup(driver.page_source, 'html.parser')
resources = soup.find_all(['img', 'link', 'script'])

for resource in resources:
    if 'src' in resource.attrs:
        url = resource['src']
    elif 'href' in resource.attrs:
        url = resource['href']
    else:
        continue
    
    response = requests.get(url)
    filename = os.path.basename(url)
    with open(filename, 'wb') as file:
        file.write(response.content)

driver.quit()

The output is a collection of resource files saved locally.

This code leverages BeautifulSoup to parse the page source and identify all associated resources. It then iterates through each resource, downloads it, and saves it with its original filename. This comprehensive method is best for offline web page representation.

Method 4: Using Browser’s Save Function

Emulating a browser’s ‘Save As’ functionality can be done by sending the appropriate key commands to the browser via Selenium. This method takes advantage of browser capabilities but may be platform-specific.

Here’s an example:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome()
driver.get('https://example.com')

driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 's')
time.sleep(2)  # Wait for the Save As dialog to open

The expected output is the browser’s ‘Save As’ dialog being triggered.

This code opens a page and simply sends the Ctrl+S keyboard command to the browser, normally used to save web pages manually. It includes a delay to ensure the dialog box has time to appear, but automating further interaction with the dialog is beyond Selenium’s capabilities and might require additional tools or user interaction.

Bonus One-Liner Method 5: Save with Page Source and Requests

A quick one-liner method to save the page source without Selenium WebDriver can be done using the Python Requests library. This is fast but does not handle JavaScript-rendered content.

Here’s an example:

import requests; open('page.html', 'w').write(requests.get('https://example.com').text)

The output is the HTML content of ‘https://example.com’ saved as ‘page.html’.

This one-liner sends a GET request to the provided URL and writes the response’s text content to ‘page.html’. It’s the most straightforward method but doesn’t render JavaScript content as Selenium does.

Summary/Discussion

  • Method 1: Saving Page Source. Strengths: Simple, effective for static content. Weaknesses: Does not capture JavaScript-rendered elements.
  • Method 2: Taking a Screenshot. Strengths: Captures visual representation, including dynamic content. Weaknesses: Results in an image file, not selectable/searchable text.
  • Method 3: Downloading Resources. Strengths: Downloads all web page assets for offline use. Weaknesses: Can be complex and resource-heavy.
  • Method 4: Using Browser’s Save Function. Strengths: Mimics manual save process. Weaknesses: Limited by browser and OS, may require additional user interaction or tools.
  • Bonus Method 5: Save with Page Source and Requests. Strengths: Quick and easy for simple pages. Weaknesses: Ineffective for JavaScript-heavy pages, lacks Selenium’s advanced capabilities.