XML stands for Extensible Mark-Up Language. This language is another way to store data. Python can parse, modify and store this file with the appropriate libraries installed.
This article covers the commonly used parameters for each function listed above. For a complete list of all parameters and their use, click here.
Preparation
Before any data manipulation can occur, three new libraries will need installation.
- The Pandas
- The Pandas_read_xml library enables reading XML files.
- The lxml library enables handling of XML and HTML files and can also be used for web scraping.
To install the libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install pandas_read_xml
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install lxml
Hit the <Enter>
key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
- How to install Pandas on PyCharm
- How to install Pandas Read XML on PyCharm
- How to install lxml on PyCharm
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd import pandas_read_xml as pdx from pandas_read_xml import flatten from lxml import etree as et
Create XML File
For this example, we took three songs from the Rolling Stone website – The Greatest Songs of All Time. To create an XML file from this data, perform the following steps.
- Highlight the XML code below.
- Press
CTRL+C
to copy the highlighted area to the system Clipboard. - Open
Notepad
(or another plain text editor). - Press
CTRL+V
to paste the contents of the system Clipboard to Notepad. - Save the file as
songs.xml
and place this file in the current working directory.
<?xml version="1.0" encoding="UTF-8"?> <top_three> <song> <title>Stronger</title> <artist>Kanye West</artist> <year>2007</year> </song> <song> <title>Truth Hurts</title> <artist>Lizzo</artist> <year>2017</year> </song> <song> <title>You're So Vain</title> <artist>Carly Simon</artist> <year>1972</year> </song> </top_three>
Read XML File
This code reads in songs.xml
from the current working directory. Then converts this file to a DataFrame.
# ... Lines [1]-[4] see "Required Starter Code" df = pdx.read_xml('songs.xml', ['top_three']) df = df.pipe(flatten) df = df.pipe(flatten) print(df)
- Line [5] reads in the XML file created earlier and sets the root factor to
top_three
– the main tag in our XML file.
- Line [6-7] flattens the DataFrame.
Run thepipe
function twice to retrieve a well-formed DataFrame. - Line [8] outputs the DataFrame to the terminal.
Output
song|title | song|artist | song|year | |
0 | Stronger | Kanye West | 2007 |
1 | Truth Hurts | Lizzo | 2017 |
2 | You’re So Vain | Carly Simon | 1972 |
DataFrame to XML File
For this example, we set the data from above to a Dictionary of Lists.
# ... Lines [1]-[4] see "Required Starter Code" df = pd.DataFrame({'title': ['Stronger', 'Truth Hurts', 'You\'re So Vain'], 'artist': ['Kanye West', 'Lizzo', 'Carly Simon'], 'year': [2007, 2017, 1972]}) root = et.Element('top_three') for row in df.iterrows(): report = et.SubElement(root, 'song') stitle = et.SubElement(report, 'title') sartist = et.SubElement(report, 'artist') syear = et.SubElement(report, 'year') stitle.text = str(row[1]['title']) sartist.text = str(row[1]['artist']) syear.text = str(row[1]['year']) tmp = '<?xml version="1.0" encoding="UTF-8"?>\n' + et.tostring(root, pretty_print=True).decode('utf-8') with open('songs2.xml', 'w') as fp: fp.write(tmp)
- Line [5] creates a DataFrame from a Dictionary of Lists.
- Line [6] sets the top tag (root tag) for the XML file. The
et
reference is from thelxml
library imported in the Required Starter Code.
- Line [7] instantiates a for loop for each row in the DataFrame using
iterrows()
. - Line [8-11] loops through one record at a time and assigns variables to each tag in the DataFrame for that record.
If you print out these variables, you will get the following output:
Output
<Element song at 0x1fe39609380> <Element title at 0x1fe399b7540> <Element artist at 0x1fe399b7500> <Element year at 0x1fe399b7580>
- Line [12-14] retrieves the text for each tag inside the
<song>
tag.
Output
Stronger Kanye West 2007 Truth Hurts Lizzo 2017 You're So Vain Carly Simon 1972
- Line [15] does the following:
- Adds a header line at the top of the string.
- The data converts to a string.
- Nicely formats the XML file using the data saved in the
for
loop. - Uses utf-8 to decode the string.
- Line [16] opens a file for writing and assigns a filename.
- Line [17] writes the appropriate data and closes the file.
A new XML file songs2.xml
should now reside in the current working directly if successful. If you compare both files, you will find they are identical.