5 Best Ways to Access HTML Source Code Using Python Selenium

πŸ’‘ Problem Formulation: Python developers often need to retrieve HTML source code of a web page for scraping, testing, or automation purposes. With Python’s Selenium, this task becomes streamlined. Suppose a developer wishes to extract the HTML source of a web page – given a URL, the output should be the raw HTML code that the browser rendered.

Method 1: Using page_source Attribute

One of the simplest methods to access the source code of a page in Selenium is via the page_source attribute of the WebDriver object. This attribute contains the entire source code of the current page as a string, exactly as seen by the user in the browser.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.example.com')
source_code = driver.page_source
print(source_code)
driver.quit()

HTML source code of ‘https://www.example.com’

In the above code snippet, initially, a Chrome WebDriver object is instantiated. The .get() method navigates to the specified web page. The page_source attribute is then used to retrieve the source code, which is printed and displayed before closing the browser.

Method 2: Using Selenium Executions Script

To retrieve the HTML of a specific element or execute JavaScript within the context of the browser, Selenium can execute scripts directly. This is particularly useful if you need to access dynamically generated code through JavaScript within the browser environment.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.example.com')
source_code = driver.execute_script('return document.documentElement.outerHTML;')
print(source_code)
driver.quit()

Entire outer HTML of ‘https://www.example.com’

The provided snippet executes JavaScript to return the outer HTML of the <html> element, effectively grabbing the full page’s HTML. The returned result is printed to the console before safely closing the web driver session.

Method 3: Using get_attribute() Method

Accessing the source code of specific elements on a webpage can be done using the get_attribute() method on a web element to retrieve its HTML. This is used when you’re interested in a particular part of the page, rather than the entire page source.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.example.com')
elem = driver.find_element_by_tag_name('body')
body_html = elem.get_attribute('outerHTML')
print(body_html)
driver.quit()

Outer HTML of the <body> tag from ‘https://www.example.com’

Here, the WebDriver fetches a web element using find_element_by_tag_name(), pinpointing the body of the document. The outerHTML attribute is accessed via get_attribute(), giving us the HTML of the body element.

Method 4: Reading InnerHTML Property

The innerHTML property of a DOM element gives the HTML or XML markup inside the element. By accessing an element’s innerHTML in Selenium, users can scrape contents that are nested within DOM elements.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.example.com')
elem = driver.find_element_by_id('content')
content_html = elem.get_attribute('innerHTML')
print(content_html)
driver.quit()

InnerHTML of the element with id ‘content’ from ‘https://www.example.com’

This code retrieves the innerHTML of an element identified by its ID. After navigating to the URL, the specific element’s inner content is accessed using get_attribute('innerHTML'), and its HTML is printed out.

Bonus One-Liner Method 5: Quick Page Source Access

For fast and straightforward retrieval of a page’s source without explicit instantiation of WebDriver, one could use a one-liner involving Selenium with a headless browser configuration.

Here’s an example:

print(webdriver.Chrome().get('https://www.example.com').page_source)

HTML source code of ‘https://www.example.com’

This one-liner initializes a headless Chrome browser, navigates to the webpage, and prints the page source code in a compact format. It’s less verbose and quickly gets the job done without much set-up.

Summary/Discussion

  • Method 1: page_source Attribute. Straightforward and direct. Might not reflect dynamic changes post-initial load.
  • Method 2: Selenium Execution Script. Allows for retrieval of dynamically generated HTML. Requires knowledge of JavaScript and may be more complex.
  • Method 3: get_attribute() for Specific Elements. Useful for targeting specific page segments. Not suited for full page source acquisition.
  • Method 4: Reading InnerHTML Property. Ideal for scraping nested content within an element. Limited to the chosen element’s inner contents.
  • Bonus Method 5: Quick Page Source Access. Extremely concise. Not recommended for complex use-cases due to lack of WebDriver configuration and cleanup.