Problem Formulation and Solution Overview
βΉοΈ XML
is an acronym for Extensible Markup Language. This file type is similar to HTML. However, XML
does not have pre-defined tags like HTML. Instead, a coder can define their own tags to meet specific requirements. XML
is a great way to transmit and share data, either locally or via the internet. This file can be parsed based on standardized XML
if structured correctly.
To make it more interesting, we have the following running scenario:
- Method 1: Use
xmltodict()
- Method 2: Use
minidom.parse()
- Method 3: Use
etree
- Method 4: Use
untangle.parse()
Method 1: Use xmltodict()
This method uses the xmltodict()
function to read an XML
file, convert it to a Dictionary
and extract the data.
In the current working directory, create an XML
file called books.xml
. Copy and paste the code snippet below into this file and save it.
<bookstore>
<book>
<title>Surrender</title>
<author>Bono</author>
<sales>21987</sales>
</book>
<book>
<title>Going Rogue</title>
<author>Janet Evanovich</author>
<sales>15986</sales>
</book>
<book>
<title>Triple Cross</title>
<author>James Patterson</author>
<sales>11311</sales>
</book>
</bookstore>
In the current working directory, create a Python file called books.py
. Copy and paste the code snippet below into this file and save it. This code reads in and parses the above XML
file. If necessary, install the xmltodict
library.
import xmltodict with open('books.xml', 'r') as fp: books_dict = xmltodict.parse(fp.read()) fp.close() for i in books_dict: for j in books_dict[i]: for k in books_dict[i][j]: print(f'Title: {k["title"]} \t Sales: {k["sales"]}')
The first line in the above code snippet imports the xmltodict
library. This library is needed to access and parse the XML
file.
The following highlighted section opens books.xml
in read mode (r
) and saves it as a File Object, fp. If fp was output to the terminal, an object similar to the one below would display.
<_io.TextIOWrapper name='books.xml' mode='r' encoding='cp1252'>
Next, the xmltodict.parse()
function is called and passed one (1) argument, fp.read()
, which reads in and parses the contents of the XML
file. The results save to books_dict
as a Dictionary
, and the file is closed. The contents of books_dict
are shown below.
{'bookstore':
{'book': [{'title': Surrender', 'author': 'Bono', 'sales': '21987'},
{'title': 'Going Rogue', 'author': 'Janet Evanovich', 'sales': '15986'},
{'title': 'Triple Cross', 'author': 'James Patterson', 'sales': '11311'}]}}
The final highlighted section loops through the above Dictionary
and extracts each book’s Title
and Sales
.
Title: Surrender Sales: 21987
Title: Going Rogue Sales: 15986
Title: Triple Cross Sales: 11311
π‘ Note: The \t
character represents the <Tab> key on the keyboard.
Method 2: Use minidom.parse()
This method uses the minidom.parse()
function to read and parse an XML file. This example extracts the ID, Title and Sales for each book.
This example differs from Method 1 as this XML
file contains an additional line at the top (<?xml version="1.0"?>
) of the file and each <book>
tag now has an id
(attribute) assigned to it.
In the current working directory, create an XML
file called books2.xml
. Copy and paste the code snippet below into this file and save it.
<?xml version="1.0"?>
<bookstore>
<storename>Jan's Best Sellers List</storename>
<book id="21237">
<title>Surrender</title>
<author>Bono</author>
<sales>21987</sales>
</book>
<book id="21946">
<title>Going Rogue</title>
<author>Janet Evanovich</author>
<sales>15986</sales>
</book>
<book id="18241">
<title>Triple Cross</title>
<author>James Patterson</author>
<sales>11311</sales>
</book>
</bookstore>
In the current working directory, create a Python file called books2.py
. Copy and paste the code snippet below into this file and save it.
from xml.dom import minidom doc = minidom.parse('books2.xml') name = doc.getElementsByTagName('storename')[0] books = doc.getElementsByTagName('book') for b in books: bid = b.getAttribute('id') title = b.getElementsByTagName('title')[0] sales = b.getElementsByTagName('sales')[0] print(f'{bid} {title.firstChild.data} {sales.firstChild.data}')
The first line in the above code snippet imports the minidom
library. This allows access to various functions to parse the XML
file and retrieve tags and attributes.
The first section of highlighted lines performs the following:
- Reads and parse the
books2.xml
file and saves the results todoc
. This action creates the Object shown as (1) below. - Retrieves the
<storename>
tag and saves the results toname
. This action creates an Object shown as (2) below. - Retrieves the
<book>
tag for eachbook
and saves the results tobooks
. This action creates a List of three (3) Objects: one for each book shown as (3) below.
(1) <xml.dom.minidom.Document object at 0x0000022D764AFEE0>
(2) <DOM Element: storename at 0x22d764f0ee0>
(3) [<DOM Element: book at 0x22d764f3a30>,
<DOM Element: book at 0x22d764f3c70>,
<DOM Element: book at 0x22d764f3eb0>]
The last section of highlighted lines loop through the books Object and outputs the results to the terminal.
21237 Surrender 21987
21946 Going Rogue 15986
18241 Triple Cross 11311
Method 3: Use etree
This method uses etree
to read in and parses an XML file. This example extracts the Title and Sales data for each book.
βΉοΈ The etree
considers the XML file as a tree structure. Each element represents a node of said tree. Accessing elements is done on an element level.
This example reads in and parses the books2.xml
file created earlier.
import xml.etree.ElementTree as ET xml_data = ET.parse('books2.xml') root = xml_data.getroot() for books in root.findall('book'): title = books.find('title').text author = books.find('author').text sales = books.find('sales').text print(title, author, sales)
The first line in the above code snippet imports the etree
library. This allows access to all nodes of the XML <tag>
structure.
The following line reads in and parses books2.xml
. The results save as an XML Object to xml_data
. If output to the terminal, an Object similar to the one below displays.
<Element 'bookstore' at 0x000001E45E9442C0>
The following highlighted section uses a for
loop to iterate through each <book>
tag, extracting the <title>
, <author>
and <sales>
tags for each book and outputting them to the terminal.
Surrender Bono 21987
Going Rogue Janet Evanovich 15986
Triple Cross James Patterson 11311
To retrieve the attribute of the <book>
tag, run the following code.
This code extracts the id
attribute from each <book>
tag and outputs it to the terminal.
{'id': '21237'}
{'id': '21946'}
{'id': '18241'}
To extract the values, run the following code.
for id in root.iter('book'): vals = id.attrib.values() for v in vals: print(vals)
21237
21946
18241
Method 4: Use untangle.parse()
This method uses untangle.parse()
to parse an XML string.
This example reads in and parses the books3.xml
file shown below. If necessary, install the untangle
library.
βΉοΈ The untangle
library converts an XML file to a Python object. This is a good option when you have a group of items, such as book names.
In the current working directory, create an XML
file called books3.xml
. Copy and paste the code snippet below into this file and save it. If necessary, install the untangle
library.
<?xml version="1.0"?> <root> <book name="Surrender"/> <book name="Going Rogue"/> <book name="Triple Cross"/> </root>
In the current working directory, create a Python file called books3.py
. Copy and paste the code snippet below into this file and save it.
import untangle book_obj = untangle.parse('books3.xml') books = ','.join([book['name'] for book in book_obj.root.book]) for b in books.split(','): print(b)
The first line in the above code snippet imports the untangle
library allowing access to the XML
file structure.
The following line reads in and parses the books3.xml
file. The results save to book_obj
.
The next line calls the join()
function and passes it one (1) argument: List Comprehension. This code iterates through and retrieves the name of each book and saves the results to books
. If output to the terminal, the following displays:
Surrender,Going Rogue,Triple Cross
The next line instantiates a for
loop, iterates through each book name, and sends it to the terminal.
Surrender
Going Rogue
Triple Cross
Summary
This article has shown four (4) ways to work with XML
files to select the best fit for your coding requirements.
Good Luck & Happy Coding!
Programmer Humor – Blockchain
