Parsing XML Using BeautifulSoup In Python

Introduction

XML is a tool that is used to store and transport data. It stands for eXtensible Markup Language. XML is quite similar to HTML and they have almost the same kind of structure but they were designed to accomplish different goals.

  • XML is designed to transport data while HTML is designed to display data. Many systems contain incompatible data formats. This makes data exchange between incompatible systems is a time-consuming task for web developers as large amounts of data has to be converted. Further, there are chances that incompatible data is lost. But, XML stores data in plain text format thereby providing software and hardware-independent method of storing and sharing data.
  • Another major difference is that HTML tags are predefined whereas XML files are not.

Example of XML:

<?xml version="1.0" encoding="UTF-8"?>
<note>
  <to>Harry Potter</to>
  <from>Albus Dumbledore</from>
  <heading>Reminder</heading>
  <body>It does not do to dwell on dreams and forget to live!</body>
</note>

As mentioned earlier, XML tags are not pre-defined so we need to find the tag that holds the information that we want to extract. Thus there are two major aspects governing the parsing of XML files:

  1. Finding the required Tags.
  2. Extracting data from after identifying the Tags.

BeautifulSoup and LXML Installation

When it comes to web scraping with Python, BeautifulSoup the most commonly used library. The recommended way of parsing XML files using BeautifulSoup is to use Python’s lxml parser.

You can install both libraries using the pip installation tool. Please have a look at our BLOG TUTORIAL to learn how to install them if you want to scrape data from an XML file using Beautiful soup.

# Note: Before we proceed with our discussion, please have a look at the following XML file that we will be using throughout the course of this article. (Please create a file with the name sample.txt and copy-paste the code given below to practice further.)

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<CATALOG>
  <PLANT>
    <COMMON>Bloodroot</COMMON>
    <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>Mostly Shady</LIGHT>
    <PRICE>$2.44</PRICE>
    <AVAILABILITY>031599</AVAILABILITY>
  </PLANT>
  <PLANT>
    <COMMON>Marsh Marigold</COMMON>
    <BOTANICAL>Caltha palustris</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>Mostly Sunny</LIGHT>
    <PRICE>$6.81</PRICE>
    <AVAILABILITY>051799</AVAILABILITY>
  </PLANT>
  <PLANT>
    <COMMON>Cowslip</COMMON>
    <BOTANICAL>Caltha palustris</BOTANICAL>
    <ZONE>4</ZONE>
    <LIGHT>Mostly Shady</LIGHT>
    <PRICE>$9.90</PRICE>
    <AVAILABILITY>030699</AVAILABILITY>
  </PLANT>
</CATALOG>

Searching The Required Tags in The XML Document

Since the tags are not pre-defined in XML, we must identify the tags and search them using the different methods provided by the BeautifulSoup library. Now, how do we find the right tags? We can do so with the help of BeautifulSoup's search methods.

Beautiful Soup has numerous methods for searching a parse tree. The two most popular and commonly used methods are:

  1.  find()
  2.  find_all()

We have an entire blog tutorial on the two methods. Please have a look at the following tutorial to understand how these search methods work.

If you have read the above-mentioned article, then you can easily use the find and find_all methods to search for tags anywhere in the XML document.

Relationship Between Tags

It is extremely important to understand the relationship between tags, especially while scraping data from XML documents.

The three key relationships in the XML parse tree are:

  • Parent: The tag which is used as the reference tag for navigating to child tags.
  • Children: The tags contained within the parent tag.
  • Siblings: As the name suggests these are the tags that exist on the same level of the parse tree.

Let us have a look at how we can navigate the XML parse tree using the above relationships.

Finding Parents

❖ The parent attribute allows us to find the parent/reference tag as shown in the example below.

Example: In the following code we will find out the parents of the common tag.

print(soup.common.parent.name)

Output:

plant

Note: The name attribute allows us to extract the name of the tag instead of extracting the entire content.

Finding Children

❖ The children attribute allows us to find the child tag as shown in the example below.

Example: In the following code we will find out the children of the plant tag.

for child in soup.plant.children:
    if child.name == None:
        pass
    else:
        print(child.name)

Output:

common
botanical
zone
light
price
availability

Finding Siblings

A tag can have siblings before and after it.

  • ❖ The previous_siblings attribute returns the siblings before the referenced tag, and the next_siblings attribute returns the siblings after it.

Example: The following code finds the previous and next sibling tags of the light tag of the XML document.

print("***Previous Siblings***")
for sibling in soup.light.previous_siblings:
    if sibling.name == None:
        pass
    else:
        print(sibling.name)

print("\n***Next Siblings***")
for sibling in soup.light.next_siblings:
    if sibling.name == None:
        pass
    else:
        print(sibling.name)

Output:

***Previous Siblings***
zone
botanical
common

***Next Siblings***
price
availability

Extracting Data From Tags

By now, we know how to navigate and find data within tags. Let us have a look at the attributes that help us to extract data from the tags.

Text And String Attributes

To access the text values within tags, you can use the text or strings attribute.

Example: let us extract the the text from the first price tag using text and string attributes.

print('***PLANT NAME***')
for tag in plant_name:
    print(tag.text)
print('\n***BOTANICAL NAME***')
for tag in scientific_name:
    print(tag.string)

Output:

***PLANT NAME***
Bloodroot
Marsh Marigold
Cowslip

***BOTANICAL NAME***
Sanguinaria canadensis
Caltha palustris
Caltha palustris

The Contents Attribute

The contents attribute allows us to extract the entire content from the tags, that is the tag along with the data. The contents attribute returns a list, therefore we can access its elements using their index.

Example:

print(soup.plant.contents)
# Accessing content using index
print()
print(soup.plant.contents[1])

Output:

['\n', <common>Bloodroot</common>, '\n', <botanical>Sanguinaria canadensis</botanical>, '\n', <zone>4</zone>, '\n', <light>Mostly Shady</light>, '\n', <price>$2.44</price>, '\n', <availability>031599</availability>, '\n']

<common>Bloodroot</common>

Pretty Printing The Beautiful Soup Object

If you observe closely when we print the tags on the screen, they have a sort of messy appearance. While this may not have direct productivity issues, but a better and structured print style helps us to parse the document more effectively.

The following code shows how the output looks when we print the BeautifulSoup object normally:

print(soup)

Output:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><html><body><catalog>
<plant>
<common>Bloodroot</common>
<botanical>Sanguinaria canadensis</botanical>
<zone>4</zone>
<light>Mostly Shady</light>
<price>$2.44</price>
<availability>031599</availability>
</plant>
<plant>
<common>Marsh Marigold</common>
<botanical>Caltha palustris</botanical>
<zone>4</zone>
<light>Mostly Sunny</light>
<price>$6.81</price>
<availability>051799</availability>
</plant>
<plant>
<common>Cowslip</common>
<botanical>Caltha palustris</botanical>
<zone>4</zone>
<light>Mostly Shady</light>
<price>$9.90</price>
<availability>030699</availability>
</plant>
</catalog>
</body></html>

Now let us use the prettify method to improve the appearance of our output.

print(soup.prettify())

Output:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
 <body>
  <catalog>
   <plant>
    <common>
     Bloodroot
    </common>
    <botanical>
     Sanguinaria canadensis
    </botanical>
    <zone>
     4
    </zone>
    <light>
     Mostly Shady
    </light>
    <price>
     $2.44
    </price>
    <availability>
     031599
    </availability>
   </plant>
   <plant>
    <common>
     Marsh Marigold
    </common>
    <botanical>
     Caltha palustris
    </botanical>
    <zone>
     4
    </zone>
    <light>
     Mostly Sunny
    </light>
    <price>
     $6.81
    </price>
    <availability>
     051799
    </availability>
   </plant>
   <plant>
    <common>
     Cowslip
    </common>
    <botanical>
     Caltha palustris
    </botanical>
    <zone>
     4
    </zone>
    <light>
     Mostly Shady
    </light>
    <price>
     $9.90
    </price>
    <availability>
     030699
    </availability>
   </plant>
  </catalog>
 </body>
</html>

The Final Solution

We are now well versed with all the concepts required to extract data from a given XML document. It is now time to have a look at the final code where we shall be extracting the Name, Botanical Name, and Price of each plant in our example XML document (sample.xml).

Please follow the comments along with the code given below to have a understanding of the logic used in the solution.

from bs4 import BeautifulSoup

# Open and read the XML file
file = open("sample.xml", "r")
contents = file.read()

# Create the BeautifulSoup Object and use the parser
soup = BeautifulSoup(contents, 'lxml')

# extract the contents of the common, botanical and price tags
plant_name = soup.find_all('common')  # store the name of the plant
scientific_name = soup.find_all('botanical')  # store the scientific name of the plant
price = soup.find_all('price')  # store the price of the plant

# Use a for loop along with the enumerate function that keeps count of each iteration
for n, title in enumerate(plant_name):
    print("Plant Name:", title.text)  # print the name of the plant using text
    print("Botanical Name: ", scientific_name[
        n].text)  # use the counter to access each index of the list that stores the scientific name of the plant
    print("Price: ",
          price[n].text)  # use the counter to access each index of the list that stores the price of the plant
    print()

Output:

Plant Name: Bloodroot
Botanical Name:  Sanguinaria canadensis
Price:  $2.44

Plant Name: Marsh Marigold
Botanical Name:  Caltha palustris
Price:  $6.81

Plant Name: Cowslip
Botanical Name:  Caltha palustris
Price:  $9.90

Conclusion

XML documents are an important source of transporting data and hopefully after reading this article you are well equipped to extract the data you want from these documents. You might be tempted to have a look at this video series where you can learn how to scrape webpages.

Please subscribe and stay tuned for more interesting articles in the future.

Where to Go From Here?

Enough theory, let’s get some practice!

To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

Practice projects is how you sharpen your saw in coding!

Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?

Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

Join my free webinar “How to Build Your High-Income Skill Python” and watch how I grew my coding business online and how you can, too—from the comfort of your own home.

Join the free webinar now!