The Pandas filter() Method in Python - Be on the Right Side of Change

The Pandas DataFrame filter() Method

In this tutorial, we will have a look at the Pandas filter() method. We will see what this function does and how we can apply it to our dataframes. As the name suggests, the filter() method filters our dataframe. To be more specific, the method subsets the rows or columns of our dataframe according to the stated index designations.

Filtering by Specific Items

To see how the method works, let’s have a look at an introductory example:

import pandas as pd

data = {
    'height': [1.68, 1.86, 2.01, 1.74],
    'children': [1, 3, 0, 2],
    'pets': [2, 3, 1, 0]
}

df = pd.DataFrame(data, index=['Josh', 'Angela', 'Tom', 'Mary'])
df

	height	children	pets
Josh	1.68	1	2
Angela	1.86	3	3
Tom	2.01	0	1
Mary	1.74	2	0

First, we import the libraries we need. In this case, it’s just Pandas. Then we create the sample dataset as a dictionary of lists. The data contains a person’s height, number of children, and number of pets. Next, we create a Pandas dataframe using the dataset and we apply each person’s name as the dataframe index. Finally, we output the dataframe.

Now, what would we do if we only wanted to see each person’s height and the number of children? We would have to filter out the “pets” column. This is where the Pandas filter() method comes into play:

df.filter(['height', 'children'])

	height	children
Josh	1.68	1
Angela	1.86	3
Tom	2.01	0
Mary	1.74	2

So, inside the parenthesis of the filter function, we pass a list of items by which we want to filter the dataframe. In this case, we choose the “height” and “children” columns, thus the output only shows the dataframe with only these two columns. That way we filtered out the “pets” column.

Another way of filtering by the “height” and “children” column looks like this:

df.filter(items=['height', 'children'])

	height	children
Josh	1.68	1
Angela	1.86	3
Tom	2.01	0
Mary	1.74	2

As you can see, the output is the same as before. We have the dataframe with the “pets” column filtered out. The only difference is that we assign the columns to the “items” parameter of the filter() function.

Filtering by Row or Column

By now we have seen how we can filter our dataframe by assigning columns to the “items” parameter. But what if we wanted to filter the dataframe by row? To achieve this, we use the “axis” parameter. Let’s have another look at the dataframe from before:

	height	children	pets
Josh	1.68	1	2
Angela	1.86	3	3
Tom	2.01	0	1
Mary	1.74	2	0

If we only want to see the height, children, and pets from Angela and Tom, the code looks like this:

df.filter(items=['Angela', 'Tom'], axis=0)

	height	children	pets
Angela	1.86	3	3
Tom	2.01	0	1

As previously, we assign the items by which to filter as a list to the “items” parameter. Additionally, we determine the axis to filter on. We assign the value “0” to the “axis” parameter. “0” means we want to filter the dataframe by row. Likewise, we could write “index” instead of “0” and get the same output.

df.filter(items=['Angela', 'Tom'], axis='index')

	height	children	pets
Angela	1.86	3	3
Tom	2.01	0	1

If we apply 1 to the “axis” parameter, we filter the dataframe by column:

df.filter(items=['height', 'children'], axis=1)

	height	children
Josh	1.68	1
Angela	1.86	3
Tom	2.01	0
Mary	1.74	2

Instead of 1, we can also apply the string "columns" to the axis parameter:

df.filter(items=['height', 'children'], axis='columns')

	height	children
Josh	1.68	1
Angela	1.86	3
Tom	2.01	0
Mary	1.74	2

We note that the output dataframe is the same as the one at the top where we do not assign an “axis” parameter at all. This is because, by default, the Pandas filter() method filters the dataframe by column if we do not assign anything else to the “axis” parameter.

However, if we want to filter by row and do not determine the “axis” parameter, we get a useless output:

df.filter(items=['Angela', 'Tom'])

Josh

Angela

Tom

Mary

Consequentially, if we filter by row we have to assign either the value “0” or "index" to the “axis” parameter, whereas if we filter by column the “axis” parameter is not necessarily needed.

Applying the “like” parameter

So far, we have seen how we can filter our data by column or row names. But instead of filtering by whole item names, we can also filter items with specific letters in them. For example, we might want to exclusively show rows containing the letter “a”. This is where we make use of the “like” parameter from the filter() method:

df.filter(like="a", axis=0)

	height	children	pets
Angela	1.86	3	3
Mary	1.74	2	0

We assign the string "a" to the “like” parameter and say we want to filter the data by row by applying the value “0” to the “axis” parameter. The output shows a new dataframe with the row items containing at least one "a". Anyway, we are not limited to using only one character here. The “like” parameter allows us to put in multiple characters as well:

df.filter(like="om", axis=0)

	height	children	pets
Tom	2.01	0	1

The output shows a dataframe again. This time, it only shows the index “Tom” because it is the only row that contains the string “om”.

Similar to this, we are able to use the “like” parameter to filter columns. We just have to assign the value “1” to the “axis” parameter to tell the program we want to filter by column instead of row:

df.filter(like="pe", axis=1)

	pets
Josh	2
Angela	3
Tom	1
Mary	0

The output displays the dataframe with the “pets” column exclusively since it is the only column containing the string "pe".

Using Regular Expressions for Filtering

Applying the “like” parameter to the filter() method allows us to filter the data by strings contained in our items. However, we might want to specify the filtering even further and, for example, filter out rows that end with the letter “a”. The “like” parameter does not work here because if we apply "a" to the “like” parameter, the program looks for items that contain the letter "a" anywhere within the item:

df.filter(like="a", axis=0)

	height	children	pets
Angela	1.86	3	3
Mary	1.74	2	0

As we can see, the output dataframe shows “Angela” as well as “Mary” because both have an “a” within them.

To get the items that end with the letter “a”, we use regular expressions. Regular expressions are used to determine if a string contains a specific search pattern. Luckily, the filter() method provides us with an optional parameter “regex”. This way, we can use regular expressions to filter our data:

df.filter(regex='a$', axis=0)

	height	children	pets
Angela	1.86	3	3

We apply "a$" to the “regex” parameter and assign “0” to the “axis” parameter. That means we filter the dataframe by row and look for any item that ends with the character “a”. As opposed to the example before with the character “a” being applied to the “like” parameter, we only get “Angela” as output and not “Angela” and “Mary” since “Angela” is the only item ending with “a”.

Likewise, we are able to use regular expressions to see which items contain characters from a specified set of characters (for example: [a,b,c]):

df.filter(regex='[a,b,c]', axis=1)

	children
Josh	1
Angela	3
Tom	0
Mary	2

Here, we are looking for all columns that contain at least one of the letters from the list [a,b,c]. Since the “children” column is the only one with at least one character from that list (the letter “c”), it is the only outputted column.

There are thousands of ways to use regular expressions combined with the filter() method. We have only seen a few so far, but it is highly recommended to learn about regular expressions if you haven’t yet since they are extremely powerful to use, not only when using the filter() method, but also on other occasions as well.

Alternatives to the filter() Function

To filter our data, we do not necessarily need to apply the filter() function. There are several ways to perform filter operations on our dataframes. One alternative is to not use any specific operation at all and to just use a list of column names within square brackets:

df[["height", "children"]]

	height	children
Josh	1.68	1
Angela	1.86	3
Tom	2.01	0
Mary	1.74	2

The output is the exact same as if we use this approach from before:

df.filter(items=['height', 'children'], axis=1)

	height	children
Josh	1.68	1
Angela	1.86	3
Tom	2.01	0
Mary	1.74	2

An alternative way for filtering rows, however, is to use the loc() operation:

df.loc[["Josh", "Angela"]]

	height	children	pets
Josh	1.68	1	2
Angela	1.86	3	3

Here, we only show the “Josh” and “Angela” row by applying these items as a list into the loc() function. The approach from before, using the filter() method, looks like this:

df.filter(items=["Josh", "Angela"], axis=0)

	height	children	pets
Josh	1.68	1	2
Angela	1.86	3	3

As we can see, there are several options for filtering our dataframes apart from the filter() method. However, the approaches we have seen here are just a few. There are many more, but it would be a bit too much to show them all here.

Summary

All in all, the filter() method is a very useful tool and it’s easy to use. It allows us to subset our dataframe rows or columns in many diverse ways. We can filter our dataframe by whole items with the “items” parameter, by a few characters using the “like” parameter, and even apply regular expressions where the filtering opportunities are nearly endless. If you want to read more about the Pandas filter() function, I recommend you read more about it in the official documentation. For more tutorials about Pandas, other Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!