Reading and Writing XML with Pandas - Be on the Right Side of Change

In this tutorial, we will learn how to read XML documents into a Pandas data frame using the read_xml() function and how to render a data frame into an XML object with the to_xml() function. Being able to work with XML documents in Pandas is very useful since we often find data stored in the XML format, especially when working with web data.

What is XML?

Before we get started working with XML documents, let’s first clarify what XML is. The term “XML” stands for “extensible markup language”, so it is a markup language, just like HTML. It was designed to store data and transport it. The differences to HTML are that XML was designed to carry data, whereas HTML was designed to display the data. Furthermore, unlike HTML tags, XML tags are not predefined.

Let’s have a look at an XML document:

<?xml version='1.0' encoding='utf-8'?>
<data>
    <student>
        <name>Alice</name>
        <major>Computer Science</major>
        <age>20</age>
    </student>
    <student>
        <name>Bob</name>
        <major>Philosophy</major>
        <age>22</age>
    </student>
    <student>
        <name>Mary</name>
        <major>Biology</major>
        <age>21</age>
    </student>
</data>

This document contains hierarchical information about student data. In the first line, we have the XML prolog which defines the XML version and the character encoding. After that comes the “data” tag which is the root element of the document and wraps the information about the students. The “student” tags are the children of the “data” tag. For each student, we get a “name”, “major”, and “age” tag respectively. Note that the tag names here are defined by the author of the document. These names are not any XML standard names.

Converting an XML document into a Pandas data frame

In this section, we will learn how to read in XML documents using the read_xml() function and how to convert these XML documents into Pandas data frames. You can find the parameters for the read_xml() function in the official documentation.

We will start with the example XML document from the last section which is contained in a separate file:

import pandas as pd
df = pd.read_xml("C:\Projects\Finxter articles example code\example.xml")
print(df)

	name	major	age
0	Alice	Computer Science	20
1	Bob	Philosophy	22
2	Mary	Biology	21

First, we import the Pandas library. Then, we create a Pandas data frame and assign it to the variable “df”. We do this by applying the read_xml() function in which we put in the path of the XML file as a string. Finally, we output “df” and get a typical Pandas data frame.

By default, the read_xml() function detects which tags to include in the data frame. Although the content in the XML file is wrapped in a “data” tag and each student’s information is wrapped in a respective “student” tag, the outputted data frame neither contains the “data” tag, nor any “student” tag. That’s because the read_xml() function only applies the tags that contain actual information, namely the “name”, “major”, and “age” tags.

The XML document we imported here came from a file on our computer. We could also put in a URL here to import an XML file from the web.

Apart from a separate file, we might also find our XML data assigned to a string in the same folder as our code:

xml = """<?xml version='1.0' encoding='utf-8'?>
<data>
    <student>
        <name>Alice</name>
        <major>Computer Science</major>
        <age>20</age>
    </student>
    <student>
        <name>Bob</name>
        <major>Philosophy</major>
        <age>22</age>
    </student>
    <student>
        <name>Mary</name>
        <major>Biology</major>
        <age>21</age>
    </student>
</data>"""

Here, we have the same XML data as before but this time it is contained inside a string and is assigned to the variable “xml”. To read in this XML data, we simply do the following:

df = pd.read_xml(xml)
print(df)

	name	major	age
0	Alice	Computer Science	20
1	Bob	Philosophy	22
2	Mary	Biology	21

Instead of a path, we put in the variable “xml” inside the read_xml() function because it contains the XML data as a string.

Alternative Structure of an XML Object

Not every XML document is suitable to be transformed into a Pandas data frame. And the ones that are suitable, are not all structured in the same way. In this section, we will have a look at an alternative structure of an XML object that we want to convert into a Pandas data frame applying the “xpath” parameter that the read_xml() function provides us with.

Let’s have a look at the following XML data assigned as a string to the variable “xml”:

xml = """<?xml version='1.0' encoding='utf-8'?>
<data>
    <student name = "Alice" major = "Computer Science" age = "20"/>
    <student name = "Bob" major = "Philosophy" age = "22"/>
    <student name = "Mary" major = "Biology" age = "21"/>
</data>"""

This XML data contains the same information as the one we have seen above but in a more compressed way. Like before, we have the “data” tag that wraps around our actual information. But unlike before, every student’s information is combined in one tag respectively. “student” is the name of the element here, whereas “name”, “major”, and “age” are the element’s attributes.

To read this XML data in properly, we do the following:

df = pd.read_xml(xml, xpath=".//student")
print(df)

	name	major	age
0	Alice	Computer Science	20
1	Bob	Philosophy	22
2	Mary	Biology	21

This time, we use the “xpath” parameter and assign it the string “.//student”. In this file structure, the “xpath” parameter expects the name of the element which is “student” in this case. The outputted data frame shows the attribute labels as the column names and the respective attribute’s values as the values of the data frame.

Rendering a Pandas data frame to an XML object

Now that we have seen how to read in an XML object and create a Pandas data frame from it, we will now learn how to perform the other way around: Converting a Pandas data frame into an XML object using the Pandas function to_xml(). You can find the parameters for the to_xml() function in the official documentation.

To achieve that, we will use the data frame that we have created in the sections before:

print(df)

	name	major	age
0	Alice	Computer Science	20
1	Bob	Philosophy	22
2	Mary	Biology	21

The approach to transform this data frame into an XML object is straightforward:

>>> df.to_xml()
"<?xml version='1.0' encoding='utf-8'?>\n<data>\n <row>\n <index>0</index>\n <name>Alice</name>\n <major>Computer Science</major>\n <age>20</age>\n </row>\n <row>\n <index>1</index>\n <name>Bob</name>\n <major>Philosophy</major>\n <age>22</age>\n </row>\n <row>\n <index>2</index>\n <name>Mary</name>\n <major>Biology</major>\n <age>21</age>\n </row>\n</data>"

All we do is apply the to_xml() function to our data frame “df”. However, the output is a bit messy. We can fix this by adding a print() statement:

print(df.to_xml())

Output:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <name>Alice</name>
    <major>Computer Science</major>
    <age>20</age>
  </row>
  <row>
    <index>1</index>
    <name>Bob</name>
    <major>Philosophy</major>
    <age>22</age>
  </row>
  <row>
    <index>2</index>
    <name>Mary</name>
    <major>Biology</major>
    <age>21</age>
  </row>
</data>

This way, we get a clear output. The XML data looks almost like the initial XML document. There are a few differences though:

Firstly, we do not have “student” tags as we had before. That’s because the data frame does not contain the word “student”. Instead, Pandas gives each row a “row” tag. Secondly, compared to the initial XML document, each student gets an “index” tag because the data frame contains indexes.

We can change these differences by applying two parameters that the to_xml() function provides us with. The “row_name” parameter determines how to call each row. As we have seen, the default value here is “row”. Furthermore, we apply the “index” parameter and set it to “False”, so we do not get the indexes inside our XML object:

print(df.to_xml(row_name = "student", index=False))

Output:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <student>
    <name>Alice</name>
    <major>Computer Science</major>
    <age>20</age>
  </student>
  <student>
    <name>Bob</name>
    <major>Philosophy</major>
    <age>22</age>
  </student>
  <student>
    <name>Mary</name>
    <major>Biology</major>
    <age>21</age>
  </student>
</data>

This way, the XML object looks like the initial one.

Using the to_xml() function, we can also create the compressed XML structure that we have seen in the previous section:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <student name="Alice" major="Computer Science" age="20"/>
  <student name="Bob" major="Philosophy" age="22"/>
  <student name="Mary" major="Biology" age="21"/>
</data>

Therefore, we apply the “attr_cols” parameter that expects a list of columns to write as attributes in the row element.

print(df.to_xml(attr_cols=["name", "major", "age"], 
                index=False, row_name = "student"))

Output:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <student name="Alice" major="Computer Science" age="20"/>
  <student name="Bob" major="Philosophy" age="22"/>
  <student name="Mary" major="Biology" age="21"/>
</data>

We apply “name”, “major”, and “age” as the attributes to the “attr_cols” parameter. And as before, we set “index” to “False” and apply “student” to the “row_name” parameter.

As we can see in the outputted XML data, “name”, “major”, and “age” are the attributes for the respective “student” tags.

Writing an XML object to an XML file

In the last section, we have learned how to convert a Pandas data frame into an XML object. In the next step, we will see how to write this XML object to its own, separate file:

data = df.to_xml(row_name = "student", index=False)

with open("new_xml.xml", "w") as file:
    file.write(data)

First, we render the data frame to an XML object, just like we did before. But this time, we do not print it out, but assign it to the variable “data”.

Then, we use the “with” statement to create the XML file. The new file gets called “new_xml.xml”. The file extension “.xml” is essential here to state that we want to create an XML file. We write the XML object into this newly created file using the “data” variable containing the XML data. This code does not produce an output. Instead, a new file gets created in the current working directory.

The new file looks like this:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <student>
    <name>Alice</name>
    <major>Computer Science</major>
    <age>20</age>
  </student>
  <student>
    <name>Bob</name>
    <major>Philosophy</major>
    <age>22</age>
  </student>
  <student>
    <name>Mary</name>
    <major>Biology</major>
    <age>21</age>
  </student>
</data>

Summary

In this tutorial, we have learned how to work with XML documents in Pandas. We have learned how to read in different structured XML documents and how to transform them into Pandas data frames. Moreover, we have seen how to convert data frames into XML documents and how to write them into separate files.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!

Programmer Humor

There are only 10 kinds of people in this world: those who know binary and those who don’t.
👩🧔‍♂️
~~~

There are 10 types of people in the world. Those who understand trinary, those who don’t, and those who mistake it for binary.
👩🧔‍♂️👱‍♀️