Read and Write Flat Files with Pandas

5/5 - (1 vote)

In this tutorial, we will learn how to read and write flat files with Pandas. We will learn what a flat file is, how to read a flat file using the Pandas functions read_csv() and read_table(), how these two functions differ from each other, and how to write a data frame to a CSV file.

What is a Flat File?

Before we start with the actual coding, let’s first clarify what a flat file is.

A flat file is a text file that stores data and has no hierarchical structure. It is composed of one data set and has no text processing or structure markup. Typical formats for flat files are, for example, ".csv", ".txt", and ".xml".

For this tutorial, we will use two flat files. A CSV file (".csv") and a text file (".txt") which both contain the same information. The only difference is the file extension.

This is how they both look like:

Name;Age;Children
Tom;21;1
Mary;30;2
Jane;29;0

The files contain information about three people, their age, and the number of children they have.

Read a Flat file with read_csv() and read_table()

Now, we head over to the coding section. We will start by simply reading the two files by using the read_csv() function and the read_table() function.

Reference: You can read more about these functions in the official Pandas documentation. Here you find the read_table() documentation, and here the read_csv() documentation.

Let’s start with read_table():

import pandas as pd
pd.read_table('example.csv')
Name;Age;Children
0Tom;21;1
1Mary;30;2
2Jane;29;0

First, we import the Pandas library. Then we read the CSV file by applying the read_table() function. Inside the read_table() function we pass in the name of the file which is "example.csv". We do not have to set the path of the file here since it is stored in the same working directory as our code. If, however, we choose a file that is stored somewhere else on our computer, we have to add the path of the file.

The output is a bit messy. It is already a Pandas data frame, but we still have the semicolons separating our data. We can change that by telling the program which separator to look for. This is achieved by setting the sep parameter of the read_table() function to ;:

pd.read_table('example.csv', sep=';')
NameAgeChildren
0Tom211
1Mary302
2Jane290

Now, the data frame has a better structure.

Let’s read the same file using the read_csv() function:

pd.read_csv('example.csv')
Name;Age;Children
0Tom;21;1
1Mary;30;2
2Jane;29;0

We get the same messy output as we got initially with the read_table() function. So again, we change the sep parameter:

pd.read_csv('example.csv', sep=';')
NameAgeChildren
0Tom211
1Mary302
2Jane290

By changing the separator, we get the same data frame as with the read_table() function.

Now we will read the text file which contains the same information as the CSV file.

We start with the read_table() function:

pd.read_table('example.txt', sep=";")
NameAgeChildren
0Tom211
1Mary302
2Jane290

We directly change the separator to “;” to avoid the messy output. The only thing we change here is the file that we read which is now the text file "example.txt".

The outputted data frame is the same as when we read the CSV file because the containing information is the same.

When we read the text file using the read_csv() function, we get the same output:

pd.read_csv('example.txt', sep=";")
NameAgeChildren
0Tom211
1Mary302
2Jane290

The Difference Between read_csv() and read_table()

Looking at the code examples from the previous section creates the impression that the read_csv() function and the read_table() function are the same.

When we look at the documentation of the two functions, we see that these two functions are indeed almost identical. The only difference between them is the default value for the separator parameter sep. With the read_table() function, the sep parameter is set to "\t" (tab-stop) by default, whereas with the read_csv() function, the sep parameter is by default set to ",".

Apart from that, these two functions do not differ from each other.

One might think the read_csv() function is meant to be used for reading CSV files only. But we already saw in the previous section that we can also use it to read other flat file types.

Thus, we can use these two functions in the same way.

The parameters of read_csv()

For the rest of the tutorial, we will use the read_csv() function instead of the read_table() function because it is common sense to use this function primarily. However, everything we will do with the read_csv() function can be conveyed to the read_table() function.

The read_csv() function provides us with various parameters to specify our reading process. By now we learned about the sep parameter which tells the program which separator to look for. We will learn more about the read_csv() function by exploring some other parameters.

We start with the usecols parameter. Here, we specify which columns we want to read from the flat file. We might have a large flat file and we might not want to import all columns. Using this parameter, we can specify which columns to include in our data frame:

pd.read_csv('example.csv', sep=";", usecols=['Name', 'Children'])
NameChildren
0Tom1
1Mary2
2Jane0

The “usecols” parameter expects a list of columns that we want to include in our data frame. Compared to the initial data frame which looks like this:

NameAgeChildren
0Tom211
1Mary302
2Jane290

we observe that the “Age” column is now missing. That’s because we did not apply it to the “usecols” parameter.

Another parameter to apply is the “index_col” parameter. We use this parameter if we want to set a column of our data frame as the new index column. For example, we might want to set the “Name” column as the new index of the data frame:

pd.read_csv('example.csv', sep=";", index_col='Name')
AgeChildren
Name
Tom211
Mary302
Jane290

The “index_col” parameter expects a string, integer, or a list of strings or integers as input. We assign the string “Name” here to state that we want to use this column as the index.

As we can see, the “Name” column is now the index of the data frame, and the integer index, we had before, is gone.

Let’s say, we want to change the column names. We transform the “Name” column to “Person“, the “Age” column to “Years“, and the “Children” column to “Pets“. How do we do that?

First of all, since we already have a header column, we have to get rid of that. We achieve that by setting the “header” parameter to “0”. Additionally, we have to create new column labels. Therefore, we apply the “names” parameter and assign it a list of the new column names ordered from left to right:

pd.read_csv('example.csv', sep=";", 
            header=0, names=['Person', 'Years', 'Pets'])
PersonYearsPets
0Tom211
1Mary302
2Jane290

This way, we get rid of the old headers and create new ones.

Write a data frame to a CSV file

Now that we have seen how to read a CSV file and create a data frame from it, in this section we will learn how to transform a data frame into a CSV file using the to_csv() function.

Reference: You can read more about the to_csv() function in the official Pandas documentation.

Therefore, we will use this data frame:

df = pd.read_csv('example.csv', sep=";")
NameAgeChildren
0Tom211
1Mary302
2Jane290

The approach to writing this data frame to a CSV file is very straightforward:

df.to_csv('new_file.csv')

Inside the function, we input the file name as a string. The file extension ".csv" is very important here to tell the program that the file we are creating is a CSV file.

We do not produce an output here. However, in the working directory, there is now a new file called "new_file.csv" and the file’s content looks like this:

,Name,Age,Children
0,Tom,21,1
1,Mary,30,2
2,Jane,29,0

By default, the values are separated by a comma. We can apply the “sep” parameter to select an alternative separator, for example, the semicolon:

df.to_csv('new_file.csv', sep=';')

Now, the file’s content looks like this:

;Name;Age;Children
0;Tom;21;1
1;Mary;30;2
2;Jane;29;0

So, we successfully transformed the separator to “;“.

Additionally, we might want to get rid of the indexes. Therefore, we set the “index” parameter to “False“:

df.to_csv('new_file.csv', sep=';', index=False)

Thus, the new file now looks like this:

Name;Age;Children
Tom;21;1
Mary;30;2
Jane;29;0

Again, we set the separator to “;“, but we also do not have the indexes anymore.

Another parameter to apply here is the “header” parameter which we can use to change the headers of the CSV file:

df.to_csv('new_file.csv', sep=';', index=False, 
          header=['Person', 'Years', 'Pets'])

This way, we get a new CSV file that looks like this:

Person;Years;Pets
Tom;21;1
Mary;30;2
Jane;29;0

Like before, we set the “index” to “False” and choose the semicolon separator. But this time, we assign the “header” parameter a list of strings with the new header labels.

Alternatively, we can also set the “header” parameter to “0” to get rid of the headers:

df.to_csv('new_file.csv', sep=';', index=False, header=0)

So, the new file has this structure:

Tom;21;1
Mary;30;2
Jane;29;0

Summary

All in all, we learned how to read flat files using two different Pandas functions and apply their various parameters. Additionally, we saw how to write a data frame to its own CSV file in several ways.

For more tutorials about Pandas, other Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!