In this tutorial, we will learn how to read XML documents into a Pandas data frame using the read_xml() function and how to render a data frame into an XML object with the to_xml() function. Being able to work with XML documents in Pandas is very useful since we often find data stored in the XML format, especially when working with web data.
What is XML?
Before we get started working with XML documents, letβs first clarify what XML is. The term βXMLβ stands for βextensible markup languageβ, so it is a markup language, just like HTML. It was designed to store data and transport it. The differences to HTML are that XML was designed to carry data, whereas HTML was designed to display the data. Furthermore, unlike HTML tags, XML tags are not predefined.
β₯οΈ Info: Are you AI curious but you still have to create real impactful projects? Join our official AI builder club on Skool (only $5): SHIP! - One Project Per Month
Letβs have a look at an XML document:
<?xml version='1.0' encoding='utf-8'?>
<data>
<student>
<name>Alice</name>
<major>Computer Science</major>
<age>20</age>
</student>
<student>
<name>Bob</name>
<major>Philosophy</major>
<age>22</age>
</student>
<student>
<name>Mary</name>
<major>Biology</major>
<age>21</age>
</student>
</data>
This document contains hierarchical information about student data. In the first line, we have the XML prolog which defines the XML version and the character encoding. After that comes the βdataβ tag which is the root element of the document and wraps the information about the students. The βstudentβ tags are the children of the βdataβ tag. For each student, we get a βnameβ, βmajorβ, and βageβ tag respectively. Note that the tag names here are defined by the author of the document. These names are not any XML standard names.
Converting an XML document into a Pandas data frame
In this section, we will learn how to read in XML documents using the read_xml() function and how to convert these XML documents into Pandas data frames. You can find the parameters for the read_xml() function in the official documentation.
We will start with the example XML document from the last section which is contained in a separate file:
import pandas as pd
df = pd.read_xml("C:\Projects\Finxter articles example code\example.xml")
print(df)| name | major | age | |
| 0 | Alice | Computer Science | 20 |
| 1 | Bob | Philosophy | 22 |
| 2 | Mary | Biology | 21 |
First, we import the Pandas library. Then, we create a Pandas data frame and assign it to the variable βdfβ. We do this by applying the read_xml() function in which we put in the path of the XML file as a string. Finally, we output βdfβ and get a typical Pandas data frame.
By default, the read_xml() function detects which tags to include in the data frame. Although the content in the XML file is wrapped in a βdataβ tag and each studentβs information is wrapped in a respective βstudentβ tag, the outputted data frame neither contains the βdataβ tag, nor any βstudentβ tag. Thatβs because the read_xml() function only applies the tags that contain actual information, namely the βnameβ, βmajorβ, and βageβ tags.
The XML document we imported here came from a file on our computer. We could also put in a URL here to import an XML file from the web.
Apart from a separate file, we might also find our XML data assigned to a string in the same folder as our code:
xml = """<?xml version='1.0' encoding='utf-8'?>
<data>
<student>
<name>Alice</name>
<major>Computer Science</major>
<age>20</age>
</student>
<student>
<name>Bob</name>
<major>Philosophy</major>
<age>22</age>
</student>
<student>
<name>Mary</name>
<major>Biology</major>
<age>21</age>
</student>
</data>"""
Here, we have the same XML data as before but this time it is contained inside a string and is assigned to the variable βxmlβ. To read in this XML data, we simply do the following:
df = pd.read_xml(xml) print(df)
| name | major | age | |
| 0 | Alice | Computer Science | 20 |
| 1 | Bob | Philosophy | 22 |
| 2 | Mary | Biology | 21 |
Instead of a path, we put in the variable βxmlβ inside the read_xml() function because it contains the XML data as a string.
Alternative Structure of an XML Object
Not every XML document is suitable to be transformed into a Pandas data frame. And the ones that are suitable, are not all structured in the same way. In this section, we will have a look at an alternative structure of an XML object that we want to convert into a Pandas data frame applying the βxpathβ parameter that the read_xml() function provides us with.
Letβs have a look at the following XML data assigned as a string to the variable βxmlβ:
xml = """<?xml version='1.0' encoding='utf-8'?>
<data>
<student name = "Alice" major = "Computer Science" age = "20"/>
<student name = "Bob" major = "Philosophy" age = "22"/>
<student name = "Mary" major = "Biology" age = "21"/>
</data>"""
This XML data contains the same information as the one we have seen above but in a more compressed way. Like before, we have the βdataβ tag that wraps around our actual information. But unlike before, every studentβs information is combined in one tag respectively. βstudentβ is the name of the element here, whereas βnameβ, βmajorβ, and βageβ are the elementβs attributes.
To read this XML data in properly, we do the following:
df = pd.read_xml(xml, xpath=".//student") print(df)
| name | major | age | |
| 0 | Alice | Computer Science | 20 |
| 1 | Bob | Philosophy | 22 |
| 2 | Mary | Biology | 21 |
This time, we use the βxpathβ parameter and assign it the string β.//studentβ. In this file structure, the βxpathβ parameter expects the name of the element which is βstudentβ in this case. The outputted data frame shows the attribute labels as the column names and the respective attributeβs values as the values of the data frame.
Rendering a Pandas data frame to an XML object
Now that we have seen how to read in an XML object and create a Pandas data frame from it, we will now learn how to perform the other way around: Converting a Pandas data frame into an XML object using the Pandas function to_xml(). You can find the parameters for the to_xml() function in the official documentation.
To achieve that, we will use the data frame that we have created in the sections before:
print(df)
| name | major | age | |
| 0 | Alice | Computer Science | 20 |
| 1 | Bob | Philosophy | 22 |
| 2 | Mary | Biology | 21 |
The approach to transform this data frame into an XML object is straightforward:
>>> df.to_xml() "<?xml version='1.0' encoding='utf-8'?>\n<data>\n <row>\n <index>0</index>\n <name>Alice</name>\n <major>Computer Science</major>\n <age>20</age>\n </row>\n <row>\n <index>1</index>\n <name>Bob</name>\n <major>Philosophy</major>\n <age>22</age>\n </row>\n <row>\n <index>2</index>\n <name>Mary</name>\n <major>Biology</major>\n <age>21</age>\n </row>\n</data>"
All we do is apply the to_xml() function to our data frame βdfβ. However, the output is a bit messy. We can fix this by adding a print() statement:
print(df.to_xml())
Output:
<?xml version='1.0' encoding='utf-8'?>
<data>
<row>
<index>0</index>
<name>Alice</name>
<major>Computer Science</major>
<age>20</age>
</row>
<row>
<index>1</index>
<name>Bob</name>
<major>Philosophy</major>
<age>22</age>
</row>
<row>
<index>2</index>
<name>Mary</name>
<major>Biology</major>
<age>21</age>
</row>
</data>
This way, we get a clear output. The XML data looks almost like the initial XML document. There are a few differences though:
Firstly, we do not have βstudentβ tags as we had before. Thatβs because the data frame does not contain the word βstudentβ. Instead, Pandas gives each row a βrowβ tag. Secondly, compared to the initial XML document, each student gets an βindexβ tag because the data frame contains indexes.
We can change these differences by applying two parameters that the to_xml() function provides us with. The βrow_nameβ parameter determines how to call each row. As we have seen, the default value here is βrowβ. Furthermore, we apply the βindexβ parameter and set it to βFalseβ, so we do not get the indexes inside our XML object:
print(df.to_xml(row_name = "student", index=False))
Output:
<?xml version='1.0' encoding='utf-8'?>
<data>
<student>
<name>Alice</name>
<major>Computer Science</major>
<age>20</age>
</student>
<student>
<name>Bob</name>
<major>Philosophy</major>
<age>22</age>
</student>
<student>
<name>Mary</name>
<major>Biology</major>
<age>21</age>
</student>
</data>
This way, the XML object looks like the initial one.
Using the to_xml() function, we can also create the compressed XML structure that we have seen in the previous section:
<?xml version='1.0' encoding='utf-8'?> <data> <student name="Alice" major="Computer Science" age="20"/> <student name="Bob" major="Philosophy" age="22"/> <student name="Mary" major="Biology" age="21"/> </data>
Therefore, we apply the βattr_colsβ parameter that expects a list of columns to write as attributes in the row element.
print(df.to_xml(attr_cols=["name", "major", "age"],
index=False, row_name = "student"))
Output:
<?xml version='1.0' encoding='utf-8'?> <data> <student name="Alice" major="Computer Science" age="20"/> <student name="Bob" major="Philosophy" age="22"/> <student name="Mary" major="Biology" age="21"/> </data>
We apply βnameβ, βmajorβ, and βageβ as the attributes to the βattr_colsβ parameter. And as before, we set βindexβ to βFalseβ and apply βstudentβ to the βrow_nameβ parameter.
As we can see in the outputted XML data, βnameβ, βmajorβ, and βageβ are the attributes for the respective βstudentβ tags.
Writing an XML object to an XML file
In the last section, we have learned how to convert a Pandas data frame into an XML object. In the next step, we will see how to write this XML object to its own, separate file:
data = df.to_xml(row_name = "student", index=False)
with open("new_xml.xml", "w") as file:
file.write(data)
First, we render the data frame to an XML object, just like we did before. But this time, we do not print it out, but assign it to the variable βdataβ.
Then, we use the βwithβ statement to create the XML file. The new file gets called βnew_xml.xmlβ. The file extension β.xmlβ is essential here to state that we want to create an XML file. We write the XML object into this newly created file using the βdataβ variable containing the XML data. This code does not produce an output. Instead, a new file gets created in the current working directory.
The new file looks like this:
<?xml version='1.0' encoding='utf-8'?>
<data>
<student>
<name>Alice</name>
<major>Computer Science</major>
<age>20</age>
</student>
<student>
<name>Bob</name>
<major>Philosophy</major>
<age>22</age>
</student>
<student>
<name>Mary</name>
<major>Biology</major>
<age>21</age>
</student>
</data>
Summary
In this tutorial, we have learned how to work with XML documents in Pandas. We have learned how to read in different structured XML documents and how to transform them into Pandas data frames. Moreover, we have seen how to convert data frames into XML documents and how to write them into separate files.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.
Happy Coding!
Programmer Humor
There are only 10 kinds of people in this world: those who know binary and those who donβt.
π©π§ββοΈ
~~~
There are 10 types of people in the world. Those who understand trinary, those who donβt, and those who mistake it for binary.
π©π§ββοΈπ±ββοΈ