5 Best Ways to Filter Rows in Python Pandas - Be on the Right Side of Change

💡 Problem Formulation: When working with datasets in Pandas, you may need to extract, remove, or modify rows based on specific criteria. Whether it is to analyze a subset or clean the data, the ability to filter rows is fundamental. For example, given a DataFrame containing sales data, you might want to filter rows where sales are above a certain threshold, resulting in a reduced DataFrame containing only the relevant records.

Method 1: Boolean Indexing

Boolean indexing is a powerful method in Pandas that allows for the filtering of rows based on the truth value of a boolean expression. You can filter rows in a DataFrame where the conditions are met, returning only the rows that satisfy those conditions. For example, to select rows with a specific column value, you can use a simple boolean expression.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# Filter rows where 'Age' is greater than 29
filtered_df = df[df['Age'] > 29]

print(filtered_df)

Output:

     Name  Age
1     Bob   30
2  Charlie  35

This code snippet filters out the rows in which the ‘Age’ column has a value greater than 29. The expression df['Age'] > 29 generates a Series of boolean values, which is then used to select the rows from the original DataFrame.

Method 2: Query Method

The query method in Pandas is used for filtering rows using a query expression. It is a concise and readable way to apply complex filtering logic. It’s particularly useful when the column names are strings and allow you to avoid more verbose syntax.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['David', 'Eva', 'Frank'],
    'Score': [88, 95, 72]
})

# Query to filter rows where 'Score' is greater than 80
filtered_df = df.query('Score > 80')

print(filtered_df)

Output:

   Name  Score
0  David     88
1    Eva     95

This snippet uses the query() method to filter out rows where the ‘Score’ is greater than 80. The query string ‘Score > 80’ is passed directly, making the code more readable and concise than using boolean indexing.

Method 3: .loc[] and .iloc[] Methods

The .loc[] and .iloc[] methods in Pandas can be used for filtering rows by label or position. .loc[] is label-based and works with the index labels of the DataFrame, while .iloc[] is position-based and works with integer positions.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['George', 'Hannah', 'Ian'],
    'Age': [22, 29, 24]
}, index=['first', 'second', 'third'])

# Filter rows in 'first' and 'second' position using .loc
filtered_df_loc = df.loc[['first', 'second']]

# Filter rows in 1st and 2nd position using .iloc
filtered_df_iloc = df.iloc[:2]

print("Filtered with .loc:")
print(filtered_df_loc)
print("\\nFiltered with .iloc:")
print(filtered_df_iloc)

Output:

Filtered with .loc:
         Name  Age
first  George   22
second Hannah   29

Filtered with .iloc:
         Name  Age
first  George   22
second Hannah   29

This code snippet demonstrates the usage of .loc[] and .iloc[] to filter rows based on labels and integer positions, respectively. Both methods resulted in the DataFrame containing the ‘first’ and ‘second’ indexed rows.

Method 4: isin() Method

The isin() method in Pandas is useful for filtering rows by checking if column values belong to a particular set. It’s a straightforward way to select rows with categorical variables or matching a list of values.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Jack', 'Kara', 'Liam'],
    'Department': ['HR', 'Marketing', 'IT']
})

# Filter rows where 'Department' is either 'HR' or 'IT'
departments = ['HR', 'IT']
filtered_df = df[df['Department'].isin(departments)]

print(filtered_df)

Output:

   Name Department
0  Jack         HR
2  Liam         IT

In this example, the isin() method is used to filter the DataFrame, selecting only the rows where the ‘Department’ is either ‘HR’ or ‘IT’. The method checks the ‘Department’ column against the defined list and returns the corresponding rows.

Bonus One-Liner Method 5: lambda expressions with filter()

A lambda function can be coupled with the filter() function in Python to apply a custom filter operation on a DataFrame. This approach is more Pythonic and can sometimes be more efficient, particularly with a large dataset.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Mia', 'Noah', 'Olivia'],
    'Salary': [70000, 80000, 75000]
})

# Convert DataFrame to iterable of tuples (index, series)
rows = df.iterrows()

# Use filter with lambda to select rows with 'Salary' above 75000
filtered_rows = filter(lambda row: row[1]['Salary'] > 75000, rows)
filtered_df = pd.DataFrame([row[1] for row in filtered_rows])

print(filtered_df)

Output:

   Name  Salary
0  Noah   80000

By converting the DataFrame to an iterable of tuples and using filter() with a lambda function, the code snippet filters out rows where the ‘Salary’ exceeds 75000. The resulting iterable is then used to construct a new DataFrame with the filtered data.

Summary/Discussion

Method 1: Boolean Indexing. Simple and efficient for basic conditions. It may become less readable with complex logic.
Method 2: Query Method. Very readable and supports complex expressions. However, it might be slower than other methods for large datasets.
Method 3: .loc[] and .iloc[]. Flexible in selecting data by label or position. They could be less intuitive to those new to Pandas.
Method 4: isin() Method. Ideal for filtering by a set of categorical values. Not suitable for continuous data or complex conditions.
Method 5: lambda expressions with filter(). Pythonic and can be efficient. However, it requires more steps and is less Pandas-oriented.