HTML Parsing using Python and LXML

In this article, you’ll learn the basics of parsing an HTML document using Python and the LXML library.

Introduction

Data is the most important ingredient in programming. It comes in all shapes and forms. Sometimes it is placed inside documents such as CSV or JSON, but sometimes it is stored on the internet or in databases. Some of it is stored/transferred or processed through the XML format, which is in many ways similar to HTML format, yet its purpose is to transfer and store data, unlike HTML, whose main purpose is to display the data. On top of that, the way of writing HTML and XML is similar. Despite the differences and similarities, they supplement each other very well.

Both Xpath and XML are engineered by the same company W3C, which imposes that Xpath is the most compatible Python module to be used for parsing the XML documents. Since one of the programing principals which would push you towards the programming success is to “not reinvent the wheel”, we are going to refer to the W3C (https://www.w3.org/) consortium document and sources in regarding the syntax and operators on our examples to bring the concept of XPath closer to the people wishing to understand it better and use it on real-life problems. 

The IT industry has accepted the XML way of transferring data as one of its principles. Imagine if one of your tasks was to gather information from the internet? Copying and pasting are one of the simplest tools to use (as it is regularly used by programmers as well); it might only lead us to gather some simple data from the web, although the process might get painfully repetitive. Yet, in case if we have more robust data, or more web pages to gather the data from, we might be inclined to use more advanced Python packages to automate our data gathering.

Before we start looking into scraping tools and strategies, it is good to know that scraping might not be legal in all cases, therefore it is highly suggested that we look at the terms of service of a particular web site, or copyright law regarding the region in which the web site operates.

For purposes of harvesting the web data, we will be using several Python libraries that allow us to do just that. The first of them is the requests module. What it does is that it sends the HTTP requests, which returns us the response object. It only used if urge to scrape the content from the internet. If we try to parse the static XML file it would not be necessary.

There are many parsing modules. LXML, Scrapy and BeautifulSoup are some of them. To tell which one is better is often neglected since their size and functionality differs from one another. For example, BeautifulSoup is more complex and serves you with more functionality, but LXML and Scrapy comes lightweight and can help you traversing through the documents using XPath and CSS selectors.

There are certain pitfalls when trying to travel through the document using XPath. Common mistake when trying to parse the XML by using XPath notation is that many people try to use the BeautifulSoup library. In fact that is not possible since it does not contain the XPath traversing methods. For those purposes we shall use the LXML library.

The requests library is used in case we want to download a HTML mark-up from the particular web site.

The first step would be to install the necessary packages. Trough pip install notation all of the modules above could be installed rather easily.

Necessary steps:

  1. pip install lxml (xpath module is a part of lxml library)
  2. pip install requests (in case the content is on a web page)

The best way to explain the XML parsing is to picture it through the examples.

The first step would be to install the necessary modules. Trough pip install notation all of the modules above could be installed rather easily.

What is the XPath?

The structure of XML and HTML documents is structurally composed of the nodes (or knots of some sort), which is a broader picture that represents the family tree-like structure. The roof instance, or the original ancestor in each tree, is called the root node, and it has no superior nodes to itself. Subordinate nodes are in that sense respectively called children or siblings, which are the elements at the same level as the children. The other terms used in navigating and traversing trough the tree are the ancestors and descendants, which in essence reflect the node relationship the same way we reflect it in real-world family tree examples. 

XPath is a query language that helps us navigate and select the node elements within a node tree. In essence, it is a step map that we need to make to reach certain elements in the tree.  The single parts of this step map are called the location steps, and each of these steps would lead us to a certain part of the document.

The terminology used for orientation along the axis (with regards to the current node) is very intuitive since it uses regular English expressions related to real-life family tree relationships.

XPath Selector

XPath selector is the condition using which we could navigate through an XML document. It describes relationships as a hierarchical order of the instances included in our path.  By combining different segment of XML syntax it helps us traverse through to the desired parts of the document. The selector is a part of the XPath query language. By simply adding different criteria, the XPath selector would lead us to different elements in the document tree.  The best way to learn the XPath selector syntax and operators is to implement it on an example. In order to know how to configure the XPath selector, it is essential to know the XPath syntax. XPath selector is compiled using an etree or HTML module which is included within the LXML package. The difference is only if we are parsing the XML document or HTML.

The selector works similarly as a find method with where it allows you to select a relative path of the element rather than the absolute one, which makes the whole traversing less prone to errors in case the absolute path gets too complicated.

XPath Syntax

XPath syntax could be divided into several groups. To have an exact grasp of the material presented we are going to apply further listed expressions and functions on our sample document, which would be listed below. In this learning session, we are going to use a web site dedicated to scraping exercises.

Node selection:

ExpressionDescription
nodenameSelects all nodes with the name “nodename
/Selects from the root node
//Selects nodes in the document from the current node that match the selection no matter where they are.
.Selects the current node
..Selects the parent of the current node
@Selects attributes

Using “..”  and “.” we can direct and switch levels as we desire. Two dot notations would lead us from wherever we are to our parent element, whereas the one dot notations would point us to the current node. 

The way that we travel from the “context node” (our reference node), which is the milestone of our search, is called “axes”, and it is noted with double slash //. What it does is that it starts traversing from the first instance of the given node. This way of path selection is called the “relative path selection”. To be certain that the // (empty tag) expression would work, it must precede an asterisk (*) or the name tag. Trough inspecting the element and copying its XPath value we are getting the absolute path.

XPath Functions and Operators

here are 6 common operators which are used inside the XPath query. Operators are noted the same way as in plain Python and serve the same purpose. The functions are meant to aid the search of desired elements or their content.

Path ExpressionResult
=Equal to
!=Not equal to
Is greater than
Is less than
=>Is greater or equal to
=<Is less or equal to

To add more functionality to our XPath expression we can use some of LXML library functions. Everything that is written in-between the “[]” is called a predicate and it is used to closer describe the search path. Most frequently used functions are contains() and starts-with(). Those functions and their results would be displayed in the table below.

Going Up and Down the Axis

The conventional syntax used to traverse up and down the XPath axes is ElementName::axis.

To reach the elements placed above or below our current axes, we might use some of the following axes.

Up the axesExamples 
ancestor//ul/li/a[contains(@href, 'categ')]/ancestor::node() 
parent//ul/li/a[contains(@href, 'categ')]/parent::node() 
preceding//ul/li/a[contains(@href, 'categ')]/preceding::div 
preceding-sibling//a[contains(@href, 'categ')]/preceding-sibling::* 
Down the axesExamples 
descendant//a[starts-with(@href, 'catalogue')]/descendant::* 
following/html/body/div/div/ul/li[1]/a 
following-sibling/html/body/div/div/ul/li[1]/a/following::li 
child//div/div/section/div[2]/ol/li[1]/article/h3/child::* 

A Simple Example

The goal of this scraping exercise is to scrape all the book genres placed at the left-hand side of the web site. It almost necessary to see the page source and to inspect some of the elements which are we aiming to scrape.

from lxml import html
import requests

url = 'http://books.toscrape.com/'

# downloading the web page by making a request objec
res = requests.get(url)

# making a tree object
tree = html.fromstring(res.text)

# navingating the tree object using the XPath
book_genres = tree.xpath("//ul/li/a[contains(@href, 'categ')]/text()")[0:60]

# since the result is the list object, we iterate the elements,
# of the list by making a simple for loop
for book_genre in book_genres:
    print (book_genre.strip())

Resources:

  1. https://lxml.de/
  2. https://scrapinghub.github.io/xpath-playground/
  3. https://2.python-requests.org/en/master/
  4. ‘http://books.toscrape.com/