Problem Formulation and Solution Overview
Arman, a Music Appreciation student at the Royal Conservatory of Music, has been given course materials in an XML file format. Arman needs to purchase these books immediately. He’s into music, not computers. He needs you to format the output into a readable format.
Navigate to the Appendix Data section and download the XML file to follow along. Then, move this file to the current working directory.
💬 Question: How would we read in ax XML file and format the output?
We can accomplish this task by one of the following options:
- Method 1: Use Beautiful Soup
- Method 2: Use XML eTree
- Method 3: Use Minidom
- Method 4: Use Pandas
- The Pandas library enables access to/from a DataFrame.
- The Beautiful Soup library enables the parsing of XML and HTML files.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install bs4
Hit the <Enter>
key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd from bs4 import BeautifulSoup import xml.etree.ElementTree as ET import base64, xml.dom.minidom from xml.dom.minidom import Node
💡 Note: The additional libraries indicated above do not require installation as they come built-in to Python.
Method 1: Use Beautiful Soup
A clean, compact way to read an XML file is to use Python’s Beautiful Soup library. A “go-to” tool for web scraping and XML data extraction.
all_books = BeautifulSoup(open('books.xml'), 'xml') pretty_xml = all_books.prettify() print(pretty_xml)
- The books.xml file is read and parsed using Beautiful Soup’s XML parser. The results are saved to
all_books
. - Next, Beautiful Soup’s
prettify()
method is used to improve the appearance of the output. - Finally, the formatted output is sent to the terminal.
Output (snippet)
<?xml version="1.0"?> |
Method 2: Use XML eTree
The ElementTree library is built-in to Python and contains functions to read and parse XML and XML-like data structures. The hierarchical data format is based on a tree structure: a root representing the tree and elements representing the nodes.
all_books = ET.parse('books.xml').getroot() for b in all_books.findall('book'): print ("{:<20} {:<30} {:<30}".format(b.find('isbn').text, b.find('title').text, b.find('price').text))
- The books.xml file is read in and parsed using the
eTree parse()
function. The results are saved toall_books
. - Next, a For loop is instantiated. It traverses through each book in
all_books
.- Each book’s details are formatted into columns and output to the terminal.
Output (snippet)
978-0393050714 | Johann Sebastian Bach | $5.99 |
978-1721260522 | Ludwig van Beethoven | $9.99 |
978-0679745822 | Johannes Brahms | $7.99 |
979-8653086533 | Frederic Chopin | $7.99 |
978-1580469036 | Claude Debussy | $13.99 |
Method 3: Use minidom
Minidom is a smaller version of DOM and comes with an API similar to other programming languages. However, feedback indicates this method is slow and a memory hogger.
with open("books.xml",'r') as fp: data = fp.read() i = 0 all_books = xml.dom.minidom.parseString(data) for book in all_books.getElementsByTagName('book'): isbn = all_books.getElementsByTagName('isbn')[i].firstChild.nodeValue title = all_books.getElementsByTagName('title')[i].firstChild.nodeValue price = all_books.getElementsByTagName('price')[i].firstChild.nodeValue print ("{:<20} {:<25} {:<20}".format(isbn, title, price)) i +=1
- The books.xml file is opened, and a file object,
fp
is created.- The contents of this file are read in and saved to
data
.
- The contents of this file are read in and saved to
- A counter variable
i
is created to loop throughall_books
and is assigned the value 0. - Then
data
is read and parsed. The results save toall_books
. - A For loop is instantiated. It traverses through each book in
all_books
.- Four (4) variables are used to locate and save the appropriate values.
- They are formatted and output to the terminal in columns.
- The counter variable is increased by one (1).
Output (snippet)
978-0393050714 | Johann Sebastian Bach | $5.99 |
978-1721260522 | Ludwig van Beethoven | $9.99 |
978-0679745822 | Johannes Brahms | $7.99 |
979-8653086533 | Frederic Chopin | $7.99 |
978-1580469036 | Claude Debussy | $13.99 |
Method 4: Use Pandas read_xml()
The Pandas library has an option to read in an XML file and convert it to a DataFrame in one easy step.
df = pd.read_xml('books.xml') print(df)
- The books.xml file is read in and saved to the DataFrame
df
. - The output automatically formats into columns (including a header row) and is output to the terminal.
Output (snippet)
isbn | title | price | |
0 | 978-0393050714 | Johann Sebastian Bach | $5.99 |
1 | 978-1721260522 | Ludwig van Beethoven | $9.99 |
2 | 978-0679745822 | Johannes Brahms | $7.99 |
3 | 979-8653086533 | Frederic Chopin | $7.99 |
4 | 978-1580469036 | Claude Debussy | $13.99 |
Appendix Data
<?xml version="1.0"?> <catalog> <book> <isbn>978-0393050714</isbn> <title>Johann Sebastian Bach</title> <price>$5.99</price> </book> <book> <isbn>978-1721260522</isbn> <title>Ludwig van Beethoven</title> <price>$9.99</price> </book> <book> <isbn>978-0679745822</isbn> <title>Johannes Brahms</title> <price>$7.99</price> </book> <book> <isbn>979-8653086533</isbn> <title>Frederic Chopin</title> <price>$7.99</price> </book> <book> <isbn>978-1580469036</isbn> <title>Claude Debussy</title> <price>$13.99</price> </book> <book> <isbn>978-0520043176</isbn> <title>Joseph Haydn</title> <price>$25.99</price> </book> <book> <isbn>978-1981659968</isbn> <title>Wolfgang Amadeus Mozart</title> <price>$8.99</price> </book> <book> <isbn>978-1482379990</isbn> <title>Franz Schubert</title> <price>$26.99</price> </book> <book> <isbn>978-0486257488</isbn> <title>Robert Schumann</title> <price>$14.99</price> </book> <book> <isbn>978-0486442723</isbn> <title>Peter Tchaikovsky</title> <price>$12.95</price> </book> </catalog>
Summary
After reviewing the above methods in conjunction with Arman’s requirements, we decide that Method 4 best meets his needs.
Problem Solved! Happy Coding!