The Pandas filter() Method in Python

Rate this post

The Pandas DataFrame filter() Method

In this tutorial, we will have a look at the Pandas filter() method. We will see what this function does and how we can apply it to our dataframes. As the name suggests, the filter() method filters our dataframe. To be more specific, the method subsets the rows or columns of our dataframe according to the stated index designations.  

Pandas DataFrame filter() Documentation

Filtering by Specific Items

To see how the method works, let’s have a look at an introductory example:

import pandas as pd

data = {
    'height': [1.68, 1.86, 2.01, 1.74],
    'children': [1, 3, 0, 2],
    'pets': [2, 3, 1, 0]
}

df = pd.DataFrame(data, index=['Josh', 'Angela', 'Tom', 'Mary'])
df
heightchildrenpets
Josh1.6812
Angela1.8633
Tom2.0101
Mary1.7420

First, we import the libraries we need. In this case, it’s just Pandas. Then we create the sample dataset as a dictionary of lists. The data contains a person’s height, number of children, and number of pets. Next, we create a Pandas dataframe using the dataset and we apply each person’s name as the dataframe index. Finally, we output the dataframe.

Now, what would we do if we only wanted to see each person’s height and the number of children? We would have to filter out the “pets” column. This is where the Pandas filter() method comes into play:

df.filter(['height', 'children'])
heightchildren
Josh1.681
Angela1.863
Tom2.010
Mary1.742

So, inside the parenthesis of the filter function, we pass a list of items by which we want to filter the dataframe. In this case, we choose the “height” and “children” columns, thus the output only shows the dataframe with only these two columns. That way we filtered out the “pets” column.

Another way of filtering by the “height” and “children” column looks like this:

df.filter(items=['height', 'children'])

heightchildren
Josh1.681
Angela1.863
Tom2.010
Mary1.742

As you can see, the output is the same as before. We have the dataframe with the “pets” column filtered out. The only difference is that we assign the columns to the “items” parameter of the filter() function.

Filtering by Row or Column

By now we have seen how we can filter our dataframe by assigning columns to the “items” parameter. But what if we wanted to filter the dataframe by row? To achieve this, we use the “axis” parameter. Let’s have another look at the dataframe from before:

heightchildrenpets
Josh1.6812
Angela1.8633
Tom2.0101
Mary1.7420

If we only want to see the height, children, and pets from Angela and Tom, the code looks like this:

df.filter(items=['Angela', 'Tom'], axis=0)
heightchildrenpets
Angela1.8633
Tom2.0101
    

As previously, we assign the items by which to filter as a list to the “items” parameter. Additionally, we determine the axis to filter on. We assign the value “0” to the “axis” parameter.  “0” means we want to filter the dataframe by row. Likewise, we could write “index” instead of “0” and get the same output.

df.filter(items=['Angela', 'Tom'], axis='index')
heightchildrenpets
Angela1.8633
Tom2.0101

If we apply 1 to the “axis” parameter, we filter the dataframe by column:

df.filter(items=['height', 'children'], axis=1)
heightchildren
Josh1.681
Angela1.863
Tom2.010
Mary1.742

Instead of 1, we can also apply the string "columns" to the axis parameter:

df.filter(items=['height', 'children'], axis='columns')
heightchildren
Josh1.681
Angela1.863
Tom2.010
Mary1.742

We note that the output dataframe is the same as the one at the top where we do not assign an “axis” parameter at all. This is because, by default, the Pandas filter() method filters the dataframe by column if we do not assign anything else to the “axis” parameter.

However, if we want to filter by row and do not determine the “axis” parameter, we get a useless output:

df.filter(items=['Angela', 'Tom'])
Josh
Angela
Tom
Mary

Consequentially, if we filter by row we have to assign either the value “0” or "index" to the “axis” parameter, whereas if we filter by column the “axis” parameter is not necessarily needed.

Applying the “like” parameter

So far, we have seen how we can filter our data by column or row names. But instead of filtering by whole item names, we can also filter items with specific letters in them. For example, we might want to exclusively show rows containing the letter “a”. This is where we make use of the “like” parameter from the filter() method:

df.filter(like="a", axis=0)
heightchildrenpets
Angela1.8633
Mary1.7420

We assign the string "a" to the “like” parameter and say we want to filter the data by row by applying the value “0” to the “axis” parameter. The output shows a new dataframe with the row items containing at least one "a". Anyway, we are not limited to using only one character here. The “like” parameter allows us to put in multiple characters as well:

df.filter(like="om", axis=0)
heightchildrenpets
Tom2.0101

The output shows a dataframe again. This time, it only shows the index “Tom” because it is the only row that contains the string “om”.

Similar to this, we are able to use the “like” parameter to filter columns. We just have to assign the value “1” to the “axis” parameter to tell the program we want to filter by column instead of row:

df.filter(like="pe", axis=1)
pets
Josh2
Angela3
Tom1
Mary0

The output displays the dataframe with the “pets” column exclusively since it is the only column containing the string "pe".

Using Regular Expressions for Filtering

Applying the “like” parameter to the filter() method allows us to filter the data by strings contained in our items. However, we might want to specify the filtering even further and, for example, filter out rows that end with the letter “a”. The “like” parameter does not work here because if we apply "a" to the “like” parameter, the program looks for items that contain the letter "a" anywhere within the item:

df.filter(like="a", axis=0)
heightchildrenpets
Angela1.8633
Mary1.7420

As we can see, the output dataframe shows “Angela” as well as “Mary” because both have an “a” within them.

To get the items that end with the letter “a”, we use regular expressions. Regular expressions are used to determine if a string contains a specific search pattern. Luckily, the filter() method provides us with an optional parameter “regex”. This way, we can use regular expressions to filter our data:

df.filter(regex='a$', axis=0)
heightchildrenpets
Angela1.8633

We apply "a$" to the “regex” parameter and assign “0” to the “axis” parameter. That means we filter the dataframe by row and look for any item that ends with the character “a”. As opposed to the example before with the character “a” being applied to the “like” parameter, we only get “Angela” as output and not “Angela” and “Mary” since “Angela” is the only item ending with “a”.

Likewise, we are able to use regular expressions to see which items contain characters from a specified set of characters (for example: [a,b,c]):

df.filter(regex='[a,b,c]', axis=1)
children
Josh1
Angela3
Tom0
Mary2

Here, we are looking for all columns that contain at least one of the letters from the list [a,b,c]. Since the “children” column is the only one with at least one character from that list (the letter “c”), it is the only outputted column.

There are thousands of ways to use regular expressions combined with the filter() method. We have only seen a few so far, but it is highly recommended to learn about regular expressions if you haven’t yet since they are extremely powerful to use, not only when using the filter() method, but also on other occasions as well.

Alternatives to the filter() Function

To filter our data, we do not necessarily need to apply the filter() function. There are several ways to perform filter operations on our dataframes. One alternative is to not use any specific operation at all and to just use a list of column names within square brackets:

df[["height", "children"]]
heightchildren
Josh1.681
Angela1.863
Tom2.010
Mary1.742

The output is the exact same as if we use this approach from before:

df.filter(items=['height', 'children'], axis=1)
heightchildren
Josh1.681
Angela1.863
Tom2.010
Mary1.742

An alternative way for filtering rows, however, is to use the loc() operation:

df.loc[["Josh", "Angela"]]
heightchildrenpets
Josh1.6812
Angela1.8633

Here, we only show the “Josh” and “Angela” row by applying these items as a list into the loc() function. The approach from before, using the filter() method, looks like this:

df.filter(items=["Josh", "Angela"], axis=0)
heightchildrenpets
Josh1.6812
Angela1.8633

As we can see, there are several options for filtering our dataframes apart from the filter() method. However, the approaches we have seen here are just a few. There are many more, but it would be a bit too much to show them all here.

Summary

All in all, the filter() method is a very useful tool and it’s easy to use. It allows us to subset our dataframe rows or columns in many diverse ways. We can filter our dataframe by whole items with the “items” parameter, by a few characters using the “like” parameter, and even apply regular expressions where the filtering opportunities are nearly endless. If you want to read more about the Pandas filter() function, I recommend you read more about it in the official documentation. For more tutorials about Pandas, other Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!