How to Read an XLS File in Python?

5/5 - (3 votes)

Problem Formulation and Solution Overview

In this article, you’ll learn how to read an XML file and format the output in Python.

To make it more fun, we have the following running scenario:

Arman, a Music Appreciation student at the Royal Conservatory of Music, has been given course materials in an XML file format. Arman needs to purchase these books immediately. He’s into music, not computers. He needs you to format the output into a readable format.

Navigate to the Appendix Data section and download the XML file to follow along. Then, move this file to the current working directory.

💬 Question: How would we read in ax XML file and format the output?

We can accomplish this task by one of the following options:


Before any data manipulation can occur, two (2) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The Beautiful Soup library enables the parsing of XML and HTML files.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install bs4

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd 
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import base64, xml.dom.minidom
from xml.dom.minidom import Node

💡 Note: The additional libraries indicated above do not require installation as they come built-in to Python.


Method 1: Use Beautiful Soup

A clean, compact way to read an XML file is to use Python’s Beautiful Soup library. A “go-to” tool for web scraping and XML data extraction.

all_books = BeautifulSoup(open('books.xml'), 'xml')
pretty_xml = all_books.prettify()
print(pretty_xml)
  • The books.xml file is read and parsed using Beautiful Soup’s XML parser. The results are saved to all_books.
  • Next, Beautiful Soup’s prettify() method is used to improve the appearance of the output.
  • Finally, the formatted output is sent to the terminal.

Output (snippet)

<?xml version="1.0"?>
<catalog>   
<book>     
<isbn>978-0393050714</isbn>
      <title>Johann Sebastian Bach</title>
      <price>$5.99</price>
   </book>
......
</catalog>

Method 2: Use XML eTree

The ElementTree library is built-in to Python and contains functions to read and parse XML and XML-like data structures. The hierarchical data format is based on a tree structure: a root representing the tree and elements representing the nodes.

all_books = ET.parse('books.xml').getroot()

for b in all_books.findall('book'):
    print ("{:<20} {:<30} {:<30}".format(b.find('isbn').text, 
                                         b.find('title').text, 
                                         b.find('price').text))
  • The books.xml file is read in and parsed using the eTree parse() function. The results are saved to all_books.
  • Next, a For loop is instantiated. It traverses through each book in all_books.
    • Each book’s details are formatted into columns and output to the terminal.

Output (snippet)

978-0393050714Johann Sebastian Bach$5.99
978-1721260522Ludwig van Beethoven$9.99
978-0679745822Johannes Brahms$7.99
979-8653086533Frederic Chopin$7.99
978-1580469036Claude Debussy$13.99

Method 3: Use minidom

Minidom is a smaller version of DOM and comes with an API similar to other programming languages. However, feedback indicates this method is slow and a memory hogger.

with open("books.xml",'r') as fp:
    data = fp.read()

i = 0
all_books = xml.dom.minidom.parseString(data)
for book in all_books.getElementsByTagName('book'):
   isbn   = all_books.getElementsByTagName('isbn')[i].firstChild.nodeValue
   title  = all_books.getElementsByTagName('title')[i].firstChild.nodeValue
   price  = all_books.getElementsByTagName('price')[i].firstChild.nodeValue
   print ("{:<20} {:<25} {:<20}".format(isbn, title, price))
   i +=1
  • The books.xml file is opened, and a file object, fp is created.
    • The contents of this file are read in and saved to data.
  • A counter variable i is created to loop through all_books and is assigned the value 0.
  • Then data is read and parsed. The results save to all_books.
  • A For loop is instantiated. It traverses through each book in all_books.
    • Four (4) variables are used to locate and save the appropriate values.
    • They are formatted and output to the terminal in columns.
    • The counter variable is increased by one (1).

Output (snippet)

978-0393050714Johann Sebastian Bach$5.99
978-1721260522Ludwig van Beethoven$9.99
978-0679745822Johannes Brahms$7.99
979-8653086533Frederic Chopin$7.99
978-1580469036Claude Debussy$13.99

Method 4: Use Pandas read_xml()

The Pandas library has an option to read in an XML file and convert it to a DataFrame in one easy step.

df = pd.read_xml('books.xml')
print(df)
Reading and writing XML with Pandas
  • The books.xml file is read in and saved to the DataFrame df.
  • The output automatically formats into columns (including a header row) and is output to the terminal.

Output (snippet)

isbntitleprice
0978-0393050714Johann Sebastian Bach$5.99
1978-1721260522Ludwig van Beethoven$9.99
2978-0679745822Johannes Brahms$7.99
3979-8653086533Frederic Chopin$7.99
4978-1580469036Claude Debussy$13.99

Appendix Data

<?xml version="1.0"?>
<catalog>
   <book>
      <isbn>978-0393050714</isbn>
      <title>Johann Sebastian Bach</title>
      <price>$5.99</price>
   </book>
   <book>
      <isbn>978-1721260522</isbn>
      <title>Ludwig van Beethoven</title>
      <price>$9.99</price>
   </book>
   <book>
      <isbn>978-0679745822</isbn>
      <title>Johannes Brahms</title>
      <price>$7.99</price>
   </book>
   <book>
      <isbn>979-8653086533</isbn>
      <title>Frederic Chopin</title>
      <price>$7.99</price>
   </book>
   <book>
      <isbn>978-1580469036</isbn>
      <title>Claude Debussy</title>
      <price>$13.99</price>
   </book>
   <book>
      <isbn>978-0520043176</isbn>
      <title>Joseph Haydn</title>
      <price>$25.99</price>
   </book>
   <book>
      <isbn>978-1981659968</isbn>
      <title>Wolfgang Amadeus Mozart</title>
      <price>$8.99</price>
   </book>
   <book>
      <isbn>978-1482379990</isbn>
      <title>Franz Schubert</title>
      <price>$26.99</price>
   </book>
   <book>
      <isbn>978-0486257488</isbn>
      <title>Robert Schumann</title>
      <price>$14.99</price>
   </book>
   <book>
      <isbn>978-0486442723</isbn>
      <title>Peter Tchaikovsky</title>
      <price>$12.95</price>
   </book>
</catalog>

Summary

After reviewing the above methods in conjunction with Arman’s requirements, we decide that Method 4 best meets his needs.

Problem Solved! Happy Coding!