In this tutorial, we will learn how to read XML documents into a Pandas data frame using the read_xml()
function and how to render a data frame into an XML object with the to_xml()
function. Being able to work with XML documents in Pandas is very useful since we often find data stored in the XML format, especially when working with web data.
What is XML?
Before we get started working with XML documents, letβs first clarify what XML is. The term βXMLβ stands for βextensible markup languageβ, so it is a markup language, just like HTML. It was designed to store data and transport it. The differences to HTML are that XML was designed to carry data, whereas HTML was designed to display the data. Furthermore, unlike HTML tags, XML tags are not predefined.
Letβs have a look at an XML document:
<?xml version='1.0' encoding='utf-8'?> <data> <student> <name>Alice</name> <major>Computer Science</major> <age>20</age> </student> <student> <name>Bob</name> <major>Philosophy</major> <age>22</age> </student> <student> <name>Mary</name> <major>Biology</major> <age>21</age> </student> </data>
This document contains hierarchical information about student data. In the first line, we have the XML prolog which defines the XML version and the character encoding. After that comes the βdata
β tag which is the root element of the document and wraps the information about the students. The βstudent
β tags are the children of the βdata
β tag. For each student, we get a βname
β, βmajor
β, and βage
β tag respectively. Note that the tag names here are defined by the author of the document. These names are not any XML standard names.
Converting an XML document into a Pandas data frame
In this section, we will learn how to read in XML documents using the read_xml()
function and how to convert these XML documents into Pandas data frames. You can find the parameters for the read_xml()
function in the official documentation.
We will start with the example XML document from the last section which is contained in a separate file:
import pandas as pd df = pd.read_xml("C:\Projects\Finxter articles example code\example.xml") print(df)
name | major | age | |
0 | Alice | Computer Science | 20 |
1 | Bob | Philosophy | 22 |
2 | Mary | Biology | 21 |
First, we import the Pandas library. Then, we create a Pandas data frame and assign it to the variable βdf
β. We do this by applying the read_xml()
function in which we put in the path of the XML file as a string. Finally, we output βdf
β and get a typical Pandas data frame.
By default, the read_xml()
function detects which tags to include in the data frame. Although the content in the XML file is wrapped in a βdata
β tag and each studentβs information is wrapped in a respective βstudent
β tag, the outputted data frame neither contains the βdata
β tag, nor any βstudent
β tag. Thatβs because the read_xml()
function only applies the tags that contain actual information, namely the βname
β, βmajor
β, and βage
β tags.
The XML document we imported here came from a file on our computer. We could also put in a URL here to import an XML file from the web.
Apart from a separate file, we might also find our XML data assigned to a string in the same folder as our code:
xml = """<?xml version='1.0' encoding='utf-8'?> <data> <student> <name>Alice</name> <major>Computer Science</major> <age>20</age> </student> <student> <name>Bob</name> <major>Philosophy</major> <age>22</age> </student> <student> <name>Mary</name> <major>Biology</major> <age>21</age> </student> </data>"""
Here, we have the same XML data as before but this time it is contained inside a string and is assigned to the variable βxml
β. To read in this XML data, we simply do the following:
df = pd.read_xml(xml) print(df)
name | major | age | |
0 | Alice | Computer Science | 20 |
1 | Bob | Philosophy | 22 |
2 | Mary | Biology | 21 |
Instead of a path, we put in the variable βxml
β inside the read_xml()
function because it contains the XML data as a string.
Alternative Structure of an XML Object
Not every XML document is suitable to be transformed into a Pandas data frame. And the ones that are suitable, are not all structured in the same way. In this section, we will have a look at an alternative structure of an XML object that we want to convert into a Pandas data frame applying the βxpath
β parameter that the read_xml()
function provides us with.
Letβs have a look at the following XML data assigned as a string to the variable βxml
β:
xml = """<?xml version='1.0' encoding='utf-8'?> <data> <student name = "Alice" major = "Computer Science" age = "20"/> <student name = "Bob" major = "Philosophy" age = "22"/> <student name = "Mary" major = "Biology" age = "21"/> </data>"""
This XML data contains the same information as the one we have seen above but in a more compressed way. Like before, we have the βdata
β tag that wraps around our actual information. But unlike before, every studentβs information is combined in one tag respectively. βstudent
β is the name of the element here, whereas βname
β, βmajor
β, and βage
β are the elementβs attributes.
To read this XML data in properly, we do the following:
df = pd.read_xml(xml, xpath=".//student") print(df)
name | major | age | |
0 | Alice | Computer Science | 20 |
1 | Bob | Philosophy | 22 |
2 | Mary | Biology | 21 |
This time, we use the βxpath
β parameter and assign it the string β.//studentβ
. In this file structure, the βxpath
β parameter expects the name of the element which is βstudent
β in this case. The outputted data frame shows the attribute labels as the column names and the respective attributeβs values as the values of the data frame.
Rendering a Pandas data frame to an XML object
Now that we have seen how to read in an XML object and create a Pandas data frame from it, we will now learn how to perform the other way around: Converting a Pandas data frame into an XML object using the Pandas function to_xml()
. You can find the parameters for the to_xml()
function in the official documentation.
To achieve that, we will use the data frame that we have created in the sections before:
print(df)
name | major | age | |
0 | Alice | Computer Science | 20 |
1 | Bob | Philosophy | 22 |
2 | Mary | Biology | 21 |
The approach to transform this data frame into an XML object is straightforward:
>>> df.to_xml() "<?xml version='1.0' encoding='utf-8'?>\n<data>\n <row>\n <index>0</index>\n <name>Alice</name>\n <major>Computer Science</major>\n <age>20</age>\n </row>\n <row>\n <index>1</index>\n <name>Bob</name>\n <major>Philosophy</major>\n <age>22</age>\n </row>\n <row>\n <index>2</index>\n <name>Mary</name>\n <major>Biology</major>\n <age>21</age>\n </row>\n</data>"
All we do is apply the to_xml()
function to our data frame βdf
β. However, the output is a bit messy. We can fix this by adding a print()
statement:
print(df.to_xml())
Output:
<?xml version='1.0' encoding='utf-8'?> <data> <row> <index>0</index> <name>Alice</name> <major>Computer Science</major> <age>20</age> </row> <row> <index>1</index> <name>Bob</name> <major>Philosophy</major> <age>22</age> </row> <row> <index>2</index> <name>Mary</name> <major>Biology</major> <age>21</age> </row> </data>
This way, we get a clear output. The XML data looks almost like the initial XML document. There are a few differences though:
Firstly, we do not have βstudent
β tags as we had before. Thatβs because the data frame does not contain the word βstudent
β. Instead, Pandas gives each row a βrow
β tag. Secondly, compared to the initial XML document, each student gets an βindex
β tag because the data frame contains indexes.
We can change these differences by applying two parameters that the to_xml()
function provides us with. The βrow_name
β parameter determines how to call each row. As we have seen, the default value here is βrow
β. Furthermore, we apply the βindex
β parameter and set it to βFalse
β, so we do not get the indexes inside our XML object:
print(df.to_xml(row_name = "student", index=False))
Output:
<?xml version='1.0' encoding='utf-8'?> <data> <student> <name>Alice</name> <major>Computer Science</major> <age>20</age> </student> <student> <name>Bob</name> <major>Philosophy</major> <age>22</age> </student> <student> <name>Mary</name> <major>Biology</major> <age>21</age> </student> </data>
This way, the XML object looks like the initial one.
Using the to_xml()
function, we can also create the compressed XML structure that we have seen in the previous section:
<?xml version='1.0' encoding='utf-8'?> <data> <student name="Alice" major="Computer Science" age="20"/> <student name="Bob" major="Philosophy" age="22"/> <student name="Mary" major="Biology" age="21"/> </data>
Therefore, we apply the βattr_cols
β parameter that expects a list of columns to write as attributes in the row element.
print(df.to_xml(attr_cols=["name", "major", "age"], index=False, row_name = "student"))
Output:
<?xml version='1.0' encoding='utf-8'?> <data> <student name="Alice" major="Computer Science" age="20"/> <student name="Bob" major="Philosophy" age="22"/> <student name="Mary" major="Biology" age="21"/> </data>
We apply βname
β, βmajor
β, and βage
β as the attributes to the βattr_cols
β parameter. And as before, we set βindex
β to βFalse
β and apply βstudent
β to the βrow_name
β parameter.
As we can see in the outputted XML data, βname
β, βmajor
β, and βage
β are the attributes for the respective βstudent
β tags.
Writing an XML object to an XML file
In the last section, we have learned how to convert a Pandas data frame into an XML object. In the next step, we will see how to write this XML object to its own, separate file:
data = df.to_xml(row_name = "student", index=False) with open("new_xml.xml", "w") as file: file.write(data)
First, we render the data frame to an XML object, just like we did before. But this time, we do not print it out, but assign it to the variable βdata
β.
Then, we use the βwith
β statement to create the XML file. The new file gets called βnew_xml.xml
β. The file extension β.xml
β is essential here to state that we want to create an XML file. We write the XML object into this newly created file using the βdata
β variable containing the XML data. This code does not produce an output. Instead, a new file gets created in the current working directory.
The new file looks like this:
<?xml version='1.0' encoding='utf-8'?> <data> <student> <name>Alice</name> <major>Computer Science</major> <age>20</age> </student> <student> <name>Bob</name> <major>Philosophy</major> <age>22</age> </student> <student> <name>Mary</name> <major>Biology</major> <age>21</age> </student> </data>
Summary
In this tutorial, we have learned how to work with XML documents in Pandas. We have learned how to read in different structured XML documents and how to transform them into Pandas data frames. Moreover, we have seen how to convert data frames into XML documents and how to write them into separate files.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.
Happy Coding!
Programmer Humor
There are only 10 kinds of people in this world: those who know binary and those who donβt.
π©π§ββοΈ
~~~
There are 10 types of people in the world. Those who understand trinary, those who donβt, and those who mistake it for binary.
π©π§ββοΈπ±ββοΈ