Reading and Writing XML with Pandas

In this tutorial, we will learn how to read XML documents into a Pandas data frame using the read_xml() function and how to render a data frame into an XML object with the to_xml() function. Being able to work with XML documents in Pandas is very useful since we often find data stored in the XML format, especially when working with web data.

What is XML?

Before we get started working with XML documents, let’s first clarify what XML is. The term β€œXML” stands for β€œextensible markup language”, so it is a markup language, just like HTML. It was designed to store data and transport it. The differences to HTML are that XML was designed to carry data, whereas HTML was designed to display the data. Furthermore, unlike HTML tags, XML tags are not predefined.

Let’s have a look at an XML document:

<?xml version='1.0' encoding='utf-8'?>
<data>
    <student>
        <name>Alice</name>
        <major>Computer Science</major>
        <age>20</age>
    </student>
    <student>
        <name>Bob</name>
        <major>Philosophy</major>
        <age>22</age>
    </student>
    <student>
        <name>Mary</name>
        <major>Biology</major>
        <age>21</age>
    </student>
</data>

This document contains hierarchical information about student data. In the first line, we have the XML prolog which defines the XML version and the character encoding. After that comes the β€œdata” tag which is the root element of the document and wraps the information about the students. The β€œstudent” tags are the children of the β€œdata” tag. For each student, we get a β€œname”, β€œmajor”, and β€œage” tag respectively. Note that the tag names here are defined by the author of the document. These names are not any XML standard names.

Converting an XML document into a Pandas data frame

In this section, we will learn how to read in XML documents using the read_xml() function and how to convert these XML documents into Pandas data frames. You can find the parameters for the read_xml() function in the official documentation.

We will start with the example XML document from the last section which is contained in a separate file:

import pandas as pd
df = pd.read_xml("C:\Projects\Finxter articles example code\example.xml")
print(df)
namemajorage
0AliceComputer Science20
1BobPhilosophy22
2MaryBiology21

First, we import the Pandas library. Then, we create a Pandas data frame and assign it to the variable β€œdf”. We do this by applying the read_xml() function in which we put in the path of the XML file as a string. Finally, we output β€œdf” and get a typical Pandas data frame.

By default, the read_xml() function detects which tags to include in the data frame. Although the content in the XML file is wrapped in a β€œdata” tag and each student’s information is wrapped in a respective β€œstudent” tag, the outputted data frame neither contains the β€œdata” tag, nor any β€œstudent” tag. That’s because the read_xml() function only applies the tags that contain actual information, namely the β€œname”, β€œmajor”, and β€œage” tags.

The XML document we imported here came from a file on our computer. We could also put in a URL here to import an XML file from the web.

Apart from a separate file, we might also find our XML data assigned to a string in the same folder as our code:

xml = """<?xml version='1.0' encoding='utf-8'?>
<data>
    <student>
        <name>Alice</name>
        <major>Computer Science</major>
        <age>20</age>
    </student>
    <student>
        <name>Bob</name>
        <major>Philosophy</major>
        <age>22</age>
    </student>
    <student>
        <name>Mary</name>
        <major>Biology</major>
        <age>21</age>
    </student>
</data>"""

Here, we have the same XML data as before but this time it is contained inside a string and is assigned to the variable β€œxml”. To read in this XML data, we simply do the following:

df = pd.read_xml(xml)
print(df)
namemajorage
0AliceComputer Science20
1BobPhilosophy22
2MaryBiology21

Instead of a path, we put in the variable β€œxml” inside the read_xml() function because it contains the XML data as a string.

Alternative Structure of an XML Object

Not every XML document is suitable to be transformed into a Pandas data frame. And the ones that are suitable, are not all structured in the same way. In this section, we will have a look at an alternative structure of an XML object that we want to convert into a Pandas data frame applying the β€œxpath” parameter that the read_xml() function provides us with.

Let’s have a look at the following XML data assigned as a string to the variable β€œxml”:

xml = """<?xml version='1.0' encoding='utf-8'?>
<data>
    <student name = "Alice" major = "Computer Science" age = "20"/>
    <student name = "Bob" major = "Philosophy" age = "22"/>
    <student name = "Mary" major = "Biology" age = "21"/>
</data>"""

This XML data contains the same information as the one we have seen above but in a more compressed way. Like before, we have the β€œdata” tag that wraps around our actual information. But unlike before, every student’s information is combined in one tag respectively. β€œstudent” is the name of the element here, whereas β€œname”, β€œmajor”, and β€œage” are the element’s attributes.

To read this XML data in properly, we do the following:

df = pd.read_xml(xml, xpath=".//student")
print(df)
namemajorage
0AliceComputer Science20
1BobPhilosophy22
2MaryBiology21

This time, we use the β€œxpath” parameter and assign it the string β€œ.//student”. In this file structure, the β€œxpath” parameter expects the name of the element which is β€œstudent” in this case. The outputted data frame shows the attribute labels as the column names and the respective attribute’s values as the values of the data frame.

Rendering a Pandas data frame to an XML object

Now that we have seen how to read in an XML object and create a Pandas data frame from it, we will now learn how to perform the other way around: Converting a Pandas data frame into an XML object using the Pandas function to_xml(). You can find the parameters for the to_xml() function in the official documentation.

To achieve that, we will use the data frame that we have created in the sections before:

print(df)
namemajorage
0AliceComputer Science20
1BobPhilosophy22
2MaryBiology21

The approach to transform this data frame into an XML object is straightforward:

>>> df.to_xml()
"<?xml version='1.0' encoding='utf-8'?>\n<data>\n <row>\n <index>0</index>\n <name>Alice</name>\n <major>Computer Science</major>\n <age>20</age>\n </row>\n <row>\n <index>1</index>\n <name>Bob</name>\n <major>Philosophy</major>\n <age>22</age>\n </row>\n <row>\n <index>2</index>\n <name>Mary</name>\n <major>Biology</major>\n <age>21</age>\n </row>\n</data>"

All we do is apply the to_xml() function to our data frame β€œdf”. However, the output is a bit messy. We can fix this by adding a print() statement:

print(df.to_xml())

Output:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <name>Alice</name>
    <major>Computer Science</major>
    <age>20</age>
  </row>
  <row>
    <index>1</index>
    <name>Bob</name>
    <major>Philosophy</major>
    <age>22</age>
  </row>
  <row>
    <index>2</index>
    <name>Mary</name>
    <major>Biology</major>
    <age>21</age>
  </row>
</data>

This way, we get a clear output. The XML data looks almost like the initial XML document. There are a few differences though:

Firstly, we do not have β€œstudent” tags as we had before. That’s because the data frame does not contain the word β€œstudent”. Instead, Pandas gives each row a β€œrow” tag. Secondly, compared to the initial XML document, each student gets an β€œindex” tag because the data frame contains indexes.

We can change these differences by applying two parameters that the to_xml() function provides us with. The β€œrow_name” parameter determines how to call each row. As we have seen, the default value here is β€œrow”. Furthermore, we apply the β€œindex” parameter and set it to β€œFalse”, so we do not get the indexes inside our XML object:

print(df.to_xml(row_name = "student", index=False))

Output:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <student>
    <name>Alice</name>
    <major>Computer Science</major>
    <age>20</age>
  </student>
  <student>
    <name>Bob</name>
    <major>Philosophy</major>
    <age>22</age>
  </student>
  <student>
    <name>Mary</name>
    <major>Biology</major>
    <age>21</age>
  </student>
</data>

This way, the XML object looks like the initial one.

Using the to_xml() function, we can also create the compressed XML structure that we have seen in the previous section:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <student name="Alice" major="Computer Science" age="20"/>
  <student name="Bob" major="Philosophy" age="22"/>
  <student name="Mary" major="Biology" age="21"/>
</data>

Therefore, we apply the β€œattr_cols” parameter that expects a list of columns to write as attributes in the row element.

print(df.to_xml(attr_cols=["name", "major", "age"], 
                index=False, row_name = "student"))

Output:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <student name="Alice" major="Computer Science" age="20"/>
  <student name="Bob" major="Philosophy" age="22"/>
  <student name="Mary" major="Biology" age="21"/>
</data>

We apply β€œname”, β€œmajor”, and β€œage” as the attributes to the β€œattr_cols” parameter. And as before, we set β€œindex” to β€œFalse” and apply β€œstudent” to the β€œrow_name” parameter.

As we can see in the outputted XML data, β€œname”, β€œmajor”, and β€œage” are the attributes for the respective β€œstudent” tags.

Writing an XML object to an XML file

In the last section, we have learned how to convert a Pandas data frame into an XML object. In the next step, we will see how to write this XML object to its own, separate file:

data = df.to_xml(row_name = "student", index=False)

with open("new_xml.xml", "w") as file:
    file.write(data)

First, we render the data frame to an XML object, just like we did before. But this time, we do not print it out, but assign it to the variable β€œdata”.

Then, we use the β€œwith” statement to create the XML file. The new file gets called β€œnew_xml.xml”. The file extension β€œ.xml” is essential here to state that we want to create an XML file. We write the XML object into this newly created file using the β€œdata” variable containing the XML data. This code does not produce an output. Instead, a new file gets created in the current working directory.

The new file looks like this:

<?xml version='1.0' encoding='utf-8'?>
<data>
  <student>
    <name>Alice</name>
    <major>Computer Science</major>
    <age>20</age>
  </student>
  <student>
    <name>Bob</name>
    <major>Philosophy</major>
    <age>22</age>
  </student>
  <student>
    <name>Mary</name>
    <major>Biology</major>
    <age>21</age>
  </student>
</data>

Summary

In this tutorial, we have learned how to work with XML documents in Pandas. We have learned how to read in different structured XML documents and how to transform them into Pandas data frames. Moreover, we have seen how to convert data frames into XML documents and how to write them into separate files.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!

Programmer Humor

There are only 10 kinds of people in this world: those who know binary and those who don’t.
πŸ‘©πŸ§”β€β™‚οΈ
~~~

There are 10 types of people in the world. Those who understand trinary, those who don’t, and those who mistake it for binary.

πŸ‘©πŸ§”β€β™‚οΈπŸ‘±β€β™€οΈ