Introduction
XML is a tool that is used to store and transport data. It stands for eXtensible Markup Language. XML is quite similar to HTML and they have almost the same kind of structure but they were designed to accomplish different goals.
- XML is designed to transport data while HTML is designed to display data. Many systems contain incompatible data formats. This makes data exchange between incompatible systems is a time-consuming task for web developers as large amounts of data has to be converted. Further, there are chances that incompatible data is lost. But, XML stores data in plain text format thereby providing software and hardware-independent method of storing and sharing data.
- Another major difference is that HTML tags are predefined whereas XML files are not.
❖ Example of XML:
<?xml version="1.0" encoding="UTF-8"?> <note> <to>Harry Potter</to> <from>Albus Dumbledore</from> <heading>Reminder</heading> <body>It does not do to dwell on dreams and forget to live!</body> </note>
As mentioned earlier, XML tags are not pre-defined so we need to find the tag that holds the information that we want to extract. Thus there are two major aspects governing the parsing of XML files:
- Finding the required Tags.
- Extracting data from after identifying the Tags.
BeautifulSoup and LXML Installation
When it comes to web scraping with Python, BeautifulSoup the most commonly used library. The recommended way of parsing XML files using BeautifulSoup is to use Python’s lxml parser.
You can install both libraries using the pip installation tool. Please have a look at our BLOG TUTORIAL to learn how to install them if you want to scrape data from an XML file using Beautiful soup.
# Note: Before we proceed with our discussion, please have a look at the following XML file that we will be using throughout the course of this article. (Please create a file with the name sample.txt and copy-paste the code given below to practice further.)
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <CATALOG> <PLANT> <COMMON>Bloodroot</COMMON> <BOTANICAL>Sanguinaria canadensis</BOTANICAL> <ZONE>4</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$2.44</PRICE> <AVAILABILITY>031599</AVAILABILITY> </PLANT> <PLANT> <COMMON>Marsh Marigold</COMMON> <BOTANICAL>Caltha palustris</BOTANICAL> <ZONE>4</ZONE> <LIGHT>Mostly Sunny</LIGHT> <PRICE>$6.81</PRICE> <AVAILABILITY>051799</AVAILABILITY> </PLANT> <PLANT> <COMMON>Cowslip</COMMON> <BOTANICAL>Caltha palustris</BOTANICAL> <ZONE>4</ZONE> <LIGHT>Mostly Shady</LIGHT> <PRICE>$9.90</PRICE> <AVAILABILITY>030699</AVAILABILITY> </PLANT> </CATALOG>
Searching The Required Tags in The XML Document
Since the tags are not pre-defined in XML, we must identify the tags and search them using the different methods provided by the BeautifulSoup library. Now, how do we find the right tags? We can do so with the help of BeautifulSoup's
search methods.
Beautiful Soup has numerous methods for searching a parse tree. The two most popular and commonly used methods are:
-
find()
-
find_all()
We have an entire blog tutorial on the two methods. Please have a look at the following tutorial to understand how these search methods work.
If you have read the above-mentioned article, then you can easily use the
and find
find_all
methods to search for tags anywhere in the XML document.
Relationship Between Tags
It is extremely important to understand the relationship between tags, especially while scraping data from XML documents.
The three key relationships in the XML parse tree are:
- Parent: The tag which is used as the reference tag for navigating to child tags.
- Children: The tags contained within the parent tag.
- Siblings: As the name suggests these are the tags that exist on the same level of the parse tree.
Let us have a look at how we can navigate the XML parse tree using the above relationships.
Finding Parents
❖ The parent attribute allows us to find the parent/reference tag as shown in the example below.
Example: In the following code we will find out the parents of the common
tag.
print(soup.common.parent.name)
Output:
plant
Note: The name
attribute allows us to extract the name of the tag instead of extracting the entire content.
Finding Children
❖ The children attribute allows us to find the child tag as shown in the example below.
Example: In the following code we will find out the children of the plant
tag.
for child in soup.plant.children: if child.name == None: pass else: print(child.name)
Output:
common botanical zone light price availability
Finding Siblings
A tag can have siblings before and after it.
- ❖ The previous_siblings attribute returns the siblings before the referenced tag, and the next_siblings attribute returns the siblings after it.
Example: The following code finds the previous and next sibling tags of the light
tag of the XML document.
print("***Previous Siblings***") for sibling in soup.light.previous_siblings: if sibling.name == None: pass else: print(sibling.name) print("\n***Next Siblings***") for sibling in soup.light.next_siblings: if sibling.name == None: pass else: print(sibling.name)
Output:
***Previous Siblings*** zone botanical common ***Next Siblings*** price availability
Extracting Data From Tags
By now, we know how to navigate and find data within tags. Let us have a look at the attributes that help us to extract data from the tags.
Text And String Attributes
To access the text values within tags, you can use the text
or strings
attribute.
Example: let us extract the the text from the first price tag using text
and string
attributes.
print('***PLANT NAME***') for tag in plant_name: print(tag.text) print('\n***BOTANICAL NAME***') for tag in scientific_name: print(tag.string)
Output:
***PLANT NAME*** Bloodroot Marsh Marigold Cowslip ***BOTANICAL NAME*** Sanguinaria canadensis Caltha palustris Caltha palustris
The Contents Attribute
The contents attribute allows us to extract the entire content from the tags, that is the tag along with the data. The contents
attribute returns a list, therefore we can access its elements using their index.
Example:
print(soup.plant.contents) # Accessing content using index print() print(soup.plant.contents[1])
Output:
['\n', <common>Bloodroot</common>, '\n', <botanical>Sanguinaria canadensis</botanical>, '\n', <zone>4</zone>, '\n', <light>Mostly Shady</light>, '\n', <price>$2.44</price>, '\n', <availability>031599</availability>, '\n'] <common>Bloodroot</common>
Pretty Printing The Beautiful Soup Object
If you observe closely when we print the tags on the screen, they have a sort of messy appearance. While this may not have direct productivity issues, but a better and structured print style helps us to parse the document more effectively.
The following code shows how the output looks when we print the BeautifulSoup object normally:
print(soup)
Output:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><html><body><catalog> <plant> <common>Bloodroot</common> <botanical>Sanguinaria canadensis</botanical> <zone>4</zone> <light>Mostly Shady</light> <price>$2.44</price> <availability>031599</availability> </plant> <plant> <common>Marsh Marigold</common> <botanical>Caltha palustris</botanical> <zone>4</zone> <light>Mostly Sunny</light> <price>$6.81</price> <availability>051799</availability> </plant> <plant> <common>Cowslip</common> <botanical>Caltha palustris</botanical> <zone>4</zone> <light>Mostly Shady</light> <price>$9.90</price> <availability>030699</availability> </plant> </catalog> </body></html>
Now let us use the prettify method to improve the appearance of our output.
print(soup.prettify())
Output:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <html> <body> <catalog> <plant> <common> Bloodroot </common> <botanical> Sanguinaria canadensis </botanical> <zone> 4 </zone> <light> Mostly Shady </light> <price> $2.44 </price> <availability> 031599 </availability> </plant> <plant> <common> Marsh Marigold </common> <botanical> Caltha palustris </botanical> <zone> 4 </zone> <light> Mostly Sunny </light> <price> $6.81 </price> <availability> 051799 </availability> </plant> <plant> <common> Cowslip </common> <botanical> Caltha palustris </botanical> <zone> 4 </zone> <light> Mostly Shady </light> <price> $9.90 </price> <availability> 030699 </availability> </plant> </catalog> </body> </html>
The Final Solution
We are now well versed with all the concepts required to extract data from a given XML document. It is now time to have a look at the final code where we shall be extracting the Name, Botanical Name, and Price of each plant in our example XML document (sample.xml).
Please follow the comments along with the code given below to have a understanding of the logic used in the solution.
from bs4 import BeautifulSoup # Open and read the XML file file = open("sample.xml", "r") contents = file.read() # Create the BeautifulSoup Object and use the parser soup = BeautifulSoup(contents, 'lxml') # extract the contents of the common, botanical and price tags plant_name = soup.find_all('common') # store the name of the plant scientific_name = soup.find_all('botanical') # store the scientific name of the plant price = soup.find_all('price') # store the price of the plant # Use a for loop along with the enumerate function that keeps count of each iteration for n, title in enumerate(plant_name): print("Plant Name:", title.text) # print the name of the plant using text print("Botanical Name: ", scientific_name[ n].text) # use the counter to access each index of the list that stores the scientific name of the plant print("Price: ", price[n].text) # use the counter to access each index of the list that stores the price of the plant print()
Output:
Plant Name: Bloodroot Botanical Name: Sanguinaria canadensis Price: $2.44 Plant Name: Marsh Marigold Botanical Name: Caltha palustris Price: $6.81 Plant Name: Cowslip Botanical Name: Caltha palustris Price: $9.90
Conclusion
XML documents are an important source of transporting data and hopefully after reading this article you are well equipped to extract the data you want from these documents. You might be tempted to have a look at this video series where you can learn how to scrape webpages.
Please subscribe and stay tuned for more interesting articles in the future.
Where to Go From Here?
Enough theory. Let’s get some practice!
Coders get paid six figures and more because they can solve problems more effectively using machine intelligence and automation.
To become more successful in coding, solve more real problems for real people. That’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
You build high-value coding skills by working on practical coding projects!
Do you want to stop learning with toy projects and focus on practical code projects that earn you money and solve real problems for people?
🚀 If your answer is YES!, consider becoming a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.
If you just want to learn about the freelancing opportunity, feel free to watch my free webinar “How to Build Your High-Income Skill Python” and learn how I grew my coding business online and how you can, too—from the comfort of your own home.