5 Best Ways to Get HTML Source of WebElement in Selenium WebDriver Using Python

Rate this post

πŸ’‘ Problem Formulation: When working with Selenium WebDriver in Python, developers may need to retrieve the HTML source of a particular WebElement. This could be crucial for tasks like web scraping, testing, or dynamic content analysis. For example, given a WebElement representing a section on a webpage, the desired output is the HTML markup that defines that section.

Method 1: Using the get_attribute Method

Retrieving the HTML source of a WebElement in Selenium can be efficiently done using the get_attribute method. This method returns the value of an attribute, and by passing in “outerHTML”, you can get the entire HTML source of the WebElement. This approach is self-contained, straightforward and widely used for its reliability and simplicity.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://example.com")
element = driver.find_element_by_id("example-id")
source_code = element.get_attribute("outerHTML")

print(source_code)
driver.quit()

Output:

<div id="example-id">...</div>

This code snippet starts with importing the necessary modules and creating a WebDriver instance. After navigating to a webpage, it locates a WebElement using its ID. It retrieves the HTML source by calling get_attribute("outerHTML") on the WebElement and then gracefully closes the driver.

Method 2: Using JavaScript Execution

Another technique to extract the HTML of a WebElement is by executing JavaScript code directly in the browser through Selenium. The JavaScript outerHTML property of the DOM element is used here. This method is especially useful if you need to run scripts or interact with page elements that are not directly accessible through standard Selenium methods.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://example.com")
element = driver.find_element_by_id("example-id")
source_code = driver.execute_script("return arguments[0].outerHTML;", element)

print(source_code)
driver.quit()

Output:

<div id="example-id">...</div>

In this code, the WebDriver executes a small piece of JavaScript code that returns the outerHTML of the targeted DOM element, referenced by arguments[0], which in this case is the WebElement found by its ID.

Method 3: Inner HTML Retrieval

Sometimes, you might only be interested in the inner HTML content of an element, without its container tag. In such scenarios, the get_attribute("innerHTML") property comes in handy. This is a straightforward method to get the wrapped content of an element.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://example.com")
element = driver.find_element_by_id("example-id")
inner_html = element.get_attribute("innerHTML")

print(inner_html)
driver.quit()

Output:

<p>Some content inside the div element.</p>

Here the example focuses on the content inside the selected WebElement, excluding the element’s own tags, and is printed to the console. This can be good for analyzing the content without the extra context of the parent tags.

Method 4: WebElement Screenshot

While not directly returning the HTML source, capturing a screenshot of a WebElement can be useful for visual inspection or testing. Selenium WebDriver provides a simple method to take a screenshot of a specific element rather than the whole page.

Here’s an example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://example.com")
element = driver.find_element_by_id("example-id")
element.screenshot('element.png')

driver.quit()

Output:

The file 'element.png' is created with the screenshot of the WebElement.

This snippet demonstrates how to capture and save a screenshot of a WebElement. The captured image can then be used for various purposes, including visual verification or keeping a record of the web page at a certain state.

Bonus One-Liner Method 5: Using page_source and String Filtering

Obtain the entire page source and then filter for the component related to the WebElement. This is not as elegant or reliable as other methods, but it can be a quick one-liner if you do not need the precision the other methods provide.

Here’s an example:

source_code = driver.page_source[driver.page_source.find('<div id="example-id">'):driver.page_source.find('</div>', driver.page_source.find('<div id="example-id">'))+6]

Output:

<div id="example-id">...</div>

The one-liner code simply gets the entire page source and slices the string to get the HTML content for a specific element. This method requires that you know the exact structure of the HTML you are trying to capture.

Summary/Discussion

  • Method 1: get_attribute(“outerHTML”). Strength: Simple and reliable way to get the full HTML of an element. Weaknesses: Relies on the presence of the attribute within the WebElement’s HTML structure.
  • Method 2: JavaScript Execution. Strengths: Very powerful and flexible, it can handle complex scenarios. Weaknesses: Slightly more complex, and it inserts JavaScript execution into the automation script.
  • Method 3: Inner HTML Retrieval. Strength: Quick and easy way to get the content within an element. Weaknesses: Does not provide the element’s own tags, just the content inside.
  • Method 4: WebElement Screenshot. Strength: Provides a visual representation of the element. Weaknesses: Does not give you the actual HTML, and the file handling adds extra complexity.
  • Method 5: page_source and String Filtering. Strength: A quick one-liner if you’re after a non-precise extraction. Weakness: Very fragile and depends highly on the consistent structure of the HTML source.