XML stands for Extensible Mark-Up Language. This language is another way to store data. Python can parse, modify and store this file with the appropriate libraries installed.
This article covers the commonly used parameters for each function listed above. For a complete list of all parameters and their use, click here.
Preparation
Before any data manipulation can occur, three new libraries will need installation.
- The Pandas- The Pandas_read_xml library enables reading XML files.
- The lxml library enables handling of XML and HTML files and can also be used for web scraping.
To install the libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter> key on the keyboard to start the installation process.
$ pip install pandas_read_xml
Hit the <Enter> key on the keyboard to start the installation process.
$ pip install lxml
 Hit the <Enter> key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
- How to install Pandas on PyCharm
- How to install Pandas Read XML on PyCharm
- How to install lxml on PyCharm
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd import pandas_read_xml as pdx from pandas_read_xml import flatten from lxml import etree as et
Create XML File
For this example, we took three songs from the Rolling Stone website – The Greatest Songs of All Time. To create an XML file from this data, perform the following steps.
- Highlight the XML code below.
- Press CTRL+Cto copy the highlighted area to the system Clipboard.
- Open Notepad(or another plain text editor).
- Press CTRL+Vto paste the contents of the system Clipboard to Notepad.
- Save the file as songs.xmland place this file in the current working directory.
<?xml version="1.0" encoding="UTF-8"?> <top_three> <song> <title>Stronger</title> <artist>Kanye West</artist> <year>2007</year> </song> <song> <title>Truth Hurts</title> <artist>Lizzo</artist> <year>2017</year> </song> <song> <title>You're So Vain</title> <artist>Carly Simon</artist> <year>1972</year> </song> </top_three>
Read XML File
This code reads in songs.xml from the current working directory. Then converts this file to a DataFrame.
# ... Lines [1]-[4] see "Required Starter Code"
df = pdx.read_xml('songs.xml', ['top_three'])
df = df.pipe(flatten)
df = df.pipe(flatten)
print(df)
- Line [5] reads in the XML file created earlier and sets the root factor to top_three– the main tag in our XML file.

- Line [6-7] flattens the DataFrame. 
 Run thepipefunction twice to retrieve a well-formed DataFrame.
- Line [8] outputs the DataFrame to the terminal.
Output
| song|title | song|artist | song|year | |
| 0 | Stronger | Kanye West | 2007 | 
| 1 | Truth Hurts | Lizzo | 2017 | 
| 2 | You’re So Vain | Carly Simon | 1972 | 
DataFrame to XML File
For this example, we set the data from above to a Dictionary of Lists.
# ... Lines [1]-[4] see "Required Starter Code"
df = pd.DataFrame({'title':    ['Stronger', 'Truth Hurts', 'You\'re So Vain'], 
                  		 'artist': ['Kanye West', 'Lizzo', 'Carly Simon'],
                   		'year':   [2007, 2017, 1972]})
root = et.Element('top_three')
for row in df.iterrows():
    report       = et.SubElement(root,   'song')
    stitle       = et.SubElement(report, 'title')
    sartist      = et.SubElement(report, 'artist')
    syear        = et.SubElement(report, 'year')
    stitle.text  = str(row[1]['title'])
    sartist.text = str(row[1]['artist'])
    syear.text   = str(row[1]['year'])
tmp = '<?xml version="1.0" encoding="UTF-8"?>\n' + et.tostring(root, pretty_print=True).decode('utf-8')
with open('songs2.xml', 'w') as fp:
    fp.write(tmp)
- Line [5] creates a DataFrame from a Dictionary of Lists.
- Line [6] sets the top tag (root tag) for the XML file. The etreference is from thelxmllibrary imported in the Required Starter Code.

- Line [7] instantiates a for loop for each row in the DataFrame using iterrows().
- Line [8-11] loops through one record at a time and assigns variables to each tag in the DataFrame for that record.

If you print out these variables, you will get the following output:
Output
<Element song at 0x1fe39609380> <Element title at 0x1fe399b7540> <Element artist at 0x1fe399b7500> <Element year at 0x1fe399b7580>
- Line [12-14] retrieves the text for each tag inside the <song>tag.

Output
Stronger Kanye West 2007 Truth Hurts Lizzo 2017 You're So Vain Carly Simon 1972
- Line [15] does the following:- Adds a header line at the top of the string.
- The data converts to a string.
- Nicely formats the XML file using the data saved in the forloop.
- Uses utf-8 to decode the string.
 
- Line [16] opens a file for writing and assigns a filename.
- Line [17] writes the appropriate data and closes the file.
A new XML file songs2.xml should now reside in the current working directly if successful. If you compare both files, you will find they are identical.
