A Guide to XML with Pandas - Be on the Right Side of Change

XML stands for Extensible Mark-Up Language. This language is another way to store data. Python can parse, modify and store this file with the appropriate libraries installed.

This article covers the commonly used parameters for each function listed above. For a complete list of all parameters and their use, click here.

Preparation

Before any data manipulation can occur, three new libraries will need installation.

The Pandas library enables access to/from a DataFrame.
The Pandas_read_xml library enables reading XML files.
The lxml library enables handling of XML and HTML files and can also be used for web scraping.

To install the libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install pandas_read_xml

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install lxml

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.

Feel free to view the PyCharm installation guide for the required libraries.

How to install Pandas on PyCharm
How to install Pandas Read XML on PyCharm
How to install lxml on PyCharm

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd 
import pandas_read_xml as pdx
from pandas_read_xml import flatten
from lxml import etree as et

Create XML File

For this example, we took three songs from the Rolling Stone website – The Greatest Songs of All Time. To create an XML file from this data, perform the following steps.

Highlight the XML code below.
Press CTRL+C to copy the highlighted area to the system Clipboard.
Open Notepad (or another plain text editor).
Press CTRL+V to paste the contents of the system Clipboard to Notepad.
Save the file as songs.xml and place this file in the current working directory.

<?xml version="1.0" encoding="UTF-8"?>
<top_three>
	<song>
	<title>Stronger</title>
	<artist>Kanye West</artist>
	<year>2007</year>
	</song>
	<song>
	<title>Truth Hurts</title>
	<artist>Lizzo</artist>
	<year>2017</year>
	</song>
	<song>
	<title>You're So Vain</title>
	<artist>Carly Simon</artist>
	<year>1972</year>
	</song>
</top_three>

Read XML File

This code reads in songs.xml from the current working directory. Then converts this file to a DataFrame.

# ... Lines [1]-[4] see "Required Starter Code"
df = pdx.read_xml('songs.xml', ['top_three'])
df = df.pipe(flatten)
df = df.pipe(flatten)
print(df)

Line [5] reads in the XML file created earlier and sets the root factor to top_three – the main tag in our XML file.

Line [6-7] flattens the DataFrame.
Run the pipe function twice to retrieve a well-formed DataFrame.
Line [8] outputs the DataFrame to the terminal.

Output

	song\|title	song\|artist	song\|year
0	Stronger	Kanye West	2007
1	Truth Hurts	Lizzo	2017
2	You’re So Vain	Carly Simon	1972

DataFrame to XML File

For this example, we set the data from above to a Dictionary of Lists.

# ... Lines [1]-[4] see "Required Starter Code"
df = pd.DataFrame({'title':    ['Stronger', 'Truth Hurts', 'You\'re So Vain'], 
                  		 'artist': ['Kanye West', 'Lizzo', 'Carly Simon'],
                   		'year':   [2007, 2017, 1972]})

root = et.Element('top_three')

for row in df.iterrows():
    report       = et.SubElement(root,   'song')
    stitle       = et.SubElement(report, 'title')
    sartist      = et.SubElement(report, 'artist')
    syear        = et.SubElement(report, 'year')

    stitle.text  = str(row[1]['title'])
    sartist.text = str(row[1]['artist'])
    syear.text   = str(row[1]['year'])

tmp = '<?xml version="1.0" encoding="UTF-8"?>\n' + et.tostring(root, pretty_print=True).decode('utf-8')

with open('songs2.xml', 'w') as fp:
    fp.write(tmp)

Line [5] creates a DataFrame from a Dictionary of Lists.
Line [6] sets the top tag (root tag) for the XML file. The et reference is from the lxml library imported in the Required Starter Code.

Line [7] instantiates a for loop for each row in the DataFrame using iterrows().
Line [8-11] loops through one record at a time and assigns variables to each tag in the DataFrame for that record.

If you print out these variables, you will get the following output:

Output

<Element song at 0x1fe39609380>
<Element title at 0x1fe399b7540>
<Element artist at 0x1fe399b7500>
<Element year at 0x1fe399b7580>

Line [12-14] retrieves the text for each tag inside the <song> tag.

Output

Stronger
Kanye West
2007
Truth Hurts
Lizzo
2017
You're So Vain
Carly Simon
1972

Line [15] does the following:
- Adds a header line at the top of the string.
- The data converts to a string.
- Nicely formats the XML file using the data saved in the for loop.
- Uses utf-8 to decode the string.
Line [16] opens a file for writing and assigns a filename.
Line [17] writes the appropriate data and closes the file.

A new XML file songs2.xml should now reside in the current working directly if successful. If you compare both files, you will find they are identical.