Parsing XML Files in Python – 4 Simple Ways

Problem Formulation and Solution Overview

This article will show you various ways to work with an XML file.

ℹ️ XML is an acronym for Extensible Markup Language. This file type is similar to HTML. However, XML does not have pre-defined tags like HTML. Instead, a coder can define their own tags to meet specific requirements. XML is a great way to transmit and share data, either locally or via the internet. This file can be parsed based on standardized XML if structured correctly.

To make it more interesting, we have the following running scenario:

Jan, a Bookstore Owner, wants to know the top three (3) selling Books in her store. This data is currently saved in an XML format.


πŸ’¬ Question: How would we write code to read in and extract data from an XML file into a Python script?

We can accomplish this by performing the following steps:


Method 1: Use xmltodict()

This method uses the xmltodict() function to read an XML file, convert it to a Dictionary and extract the data.

In the current working directory, create an XML file called books.xml. Copy and paste the code snippet below into this file and save it.

<bookstore>
    <book>
        <title>Surrender</title>
        <author>Bono</author>
        <sales>21987</sales>
    </book>
    <book>
        <title>Going Rogue</title>
        <author>Janet Evanovich</author>
        <sales>15986</sales>
    </book>
    <book>
        <title>Triple Cross</title>
        <author>James Patterson</author>
        <sales>11311</sales>
    </book>
</bookstore>

In the current working directory, create a Python file called books.py. Copy and paste the code snippet below into this file and save it. This code reads in and parses the above XML file. If necessary, install the xmltodict library.

import xmltodict

with open('books.xml', 'r') as fp:
    books_dict =  xmltodict.parse(fp.read())
    fp.close()

for i in books_dict:
    for j in books_dict[i]:
        for k in books_dict[i][j]:
            print(f'Title: {k["title"]} \t Sales: {k["sales"]}')

The first line in the above code snippet imports the xmltodict library. This library is needed to access and parse the XML file.

The following highlighted section opens books.xml in read mode (r) and saves it as a File Object, fp. If fp was output to the terminal, an object similar to the one below would display.

<_io.TextIOWrapper name='books.xml' mode='r' encoding='cp1252'>

Next, the xmltodict.parse() function is called and passed one (1) argument, fp.read(), which reads in and parses the contents of the XML file. The results save to books_dict as a Dictionary, and the file is closed. The contents of books_dict are shown below.

{'bookstore': 
  {'book': [{'title': Surrender', 'author': 'Bono', 'sales': '21987'}, 
              {'title': 'Going Rogue', 'author': 'Janet Evanovich', 'sales': '15986'}, 
              {'title': 'Triple Cross', 'author': 'James Patterson', 'sales': '11311'}]}}

The final highlighted section loops through the above Dictionary and extracts each book’s Title and Sales.

Title: Surrender         Sales: 21987
Title: Going Rogue    Sales: 15986
Title: Triple Cross      Sales: 11311

πŸ’‘ Note: The \t character represents the <Tab> key on the keyboard.


Method 2: Use minidom.parse()

This method uses the minidom.parse() function to read and parse an XML file. This example extracts the ID, Title and Sales for each book.

This example differs from Method 1 as this XML file contains an additional line at the top (<?xml version="1.0"?>) of the file and each <book> tag now has an id (attribute) assigned to it.

In the current working directory, create an XML file called books2.xml. Copy and paste the code snippet below into this file and save it.

<?xml version="1.0"?>
<bookstore>
	<storename>Jan's Best Sellers List</storename>
	<book id="21237">
		<title>Surrender</title>
                <author>Bono</author>
		<sales>21987</sales>
	</book>
	<book id="21946">
		<title>Going Rogue</title>
                <author>Janet Evanovich</author>
		<sales>15986</sales>
	</book>
	<book id="18241">
		<title>Triple Cross</title>
                <author>James Patterson</author>
		<sales>11311</sales>
	</book>
</bookstore>

In the current working directory, create a Python file called books2.py. Copy and paste the code snippet below into this file and save it.

from xml.dom import minidom

doc = minidom.parse('books2.xml')
name = doc.getElementsByTagName('storename')[0]
books = doc.getElementsByTagName('book')

for b in books:
        bid    = b.getAttribute('id')
        title  = b.getElementsByTagName('title')[0]
        sales  = b.getElementsByTagName('sales')[0]
        print(f'{bid} {title.firstChild.data} {sales.firstChild.data}')

The first line in the above code snippet imports the minidom library. This allows access to various functions to parse the XML file and retrieve tags and attributes.

The first section of highlighted lines performs the following:

  • Reads and parse the books2.xml file and saves the results to doc. This action creates the Object shown as (1) below.
  • Retrieves the <storename> tag and saves the results to name. This action creates an Object shown as (2) below.
  • Retrieves the <book> tag for each book and saves the results to books. This action creates a List of three (3) Objects: one for each book shown as (3) below.
(1) <xml.dom.minidom.Document object at 0x0000022D764AFEE0> 
(2) <DOM Element: storename at 0x22d764f0ee0> 
(3) [<DOM Element: book at 0x22d764f3a30>, 
<DOM Element: book at 0x22d764f3c70>, 
<DOM Element: book at 0x22d764f3eb0>]

The last section of highlighted lines loop through the books Object and outputs the results to the terminal.

21237 Surrender 21987
21946 Going Rogue 15986
18241 Triple Cross 11311

Method 3: Use etree

This method uses etree to read in and parses an XML file. This example extracts the Title and Sales data for each book.

ℹ️ The etree considers the XML file as a tree structure. Each element represents a node of said tree. Accessing elements is done on an element level.

This example reads in and parses the books2.xml file created earlier.

import xml.etree.ElementTree as ET

xml_data = ET.parse('books2.xml')
root = xml_data.getroot()

for books in root.findall('book'):
    title = books.find('title').text
    author = books.find('author').text
    sales = books.find('sales').text
    print(title, author, sales)

The first line in the above code snippet imports the etree library. This allows access to all nodes of the XML <tag> structure.

The following line reads in and parses books2.xml. The results save as an XML Object to xml_data. If output to the terminal, an Object similar to the one below displays.

<Element 'bookstore' at 0x000001E45E9442C0>

The following highlighted section uses a for loop to iterate through each <book> tag, extracting the <title>, <author> and <sales> tags for each book and outputting them to the terminal.

Surrender Bono 21987
Going Rogue Janet Evanovich 15986
Triple Cross James Patterson 11311

To retrieve the attribute of the <book> tag, run the following code.

This code extracts the id attribute from each <book> tag and outputs it to the terminal.

{'id': '21237'}
{'id': '21946'}
{'id': '18241'}

To extract the values, run the following code.

for id in root.iter('book'):
    vals = id.attrib.values()
    for v in vals:
        print(vals)
21237
21946
18241

Method 4: Use untangle.parse()

This method uses untangle.parse() to parse an XML string.

This example reads in and parses the books3.xml file shown below. If necessary, install the untangle library.

ℹ️ The untangle library converts an XML file to a Python object. This is a good option when you have a group of items, such as book names.

In the current working directory, create an XML file called books3.xml. Copy and paste the code snippet below into this file and save it. If necessary, install the untangle library.

<?xml version="1.0"?>
<root>
    <book name="Surrender"/>
    <book name="Going Rogue"/>
    <book name="Triple Cross"/>
</root>

In the current working directory, create a Python file called books3.py. Copy and paste the code snippet below into this file and save it.

import untangle

book_obj = untangle.parse('books3.xml')
books = ','.join([book['name'] for book in book_obj.root.book])

for b in books.split(','):
    print(b)

The first line in the above code snippet imports the untangle library allowing access to the XML file structure.

The following line reads in and parses the books3.xml file. The results save to book_obj.

The next line calls the join() function and passes it one (1) argument: List Comprehension. This code iterates through and retrieves the name of each book and saves the results to books. If output to the terminal, the following displays:

 Surrender,Going Rogue,Triple Cross

The next line instantiates a for loop, iterates through each book name, and sends it to the terminal.

Surrender
Going Rogue
Triple Cross

Summary

This article has shown four (4) ways to work with XML files to select the best fit for your coding requirements.

Good Luck & Happy Coding!


Programmer Humor – Blockchain

“Blockchains are like grappling hooks, in that it’s extremely cool when you encounter a problem for which they’re the right solution, but it happens way too rarely in real life.” source xkcd