5 Best Ways to Filter Rows Based on Column Values with Query Function in Pandas

πŸ’‘ Problem Formulation: When working with data in Python, analysts often need to filter DataFrame rows based on specific conditions applied to column values. For instance, given a DataFrame containing sales data, one might wish to extract records where the ‘sales’ column exceeds $500. This article provides multiple methods to accomplish such filtering using the query function in pandas with real-world code examples and discussions.

Method 1: Basic query filtering

The query function in Pandas DataFrame can be used to filter rows based on a condition string. It’s like a SQL WHERE clause but in Python. The syntax is intuitive and readable, allowing you to easily specify which rows to keep in your DataFrame.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'product': ['apple', 'banana', 'cherry', 'date'],
    'sales': [300, 120, 150, 400]
})

# Filtering using query
filtered_df = df.query('sales > 250')
print(filtered_df)

Output:

  product  sales
0   apple    300
3    date    400

In the code above, we create a DataFrame with two columns: ‘product’ and ‘sales’. We then use the query() method to filter rows where ‘sales’ are greater than 250. The resulting filtered_df DataFrame contains only the rows that meet this condition.

Method 2: Query with variable

Pandas query function also lets you use external variables in your query string. This is particularly useful when you want to parameterize your queries. The syntax @ is used to denote that the following identifier is a variable name, not a column name.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'product': ['apple', 'banana', 'cherry', 'date'],
    'sales': [300, 120, 150, 400]
})

# External variable
min_sales = 200

# Filtering using query with variable
filtered_df = df.query('sales > @min_sales')
print(filtered_df)

Output:

  product  sales
0   apple    300
3    date    400

Here, min_sales is defined as an external variable, and we pass it into the query to dynamically filter the DataFrame based on its value. This allows for greater flexibility and reusability of the query string.

Method 3: Query with index

The query function can also operate directly on the index of a DataFrame. This is particularly useful when dealing with time series data or any dataset where the index bears significant meaning.

Here’s an example:

import pandas as pd

# Sample DataFrame with index
df = pd.DataFrame({
    'sales': [300, 120, 150, 400]
}, index=['apple', 'banana', 'cherry', 'date'])

# Filtering using query on index
filtered_df = df.query('index == "apple" or index == "date"')
print(filtered_df)

Output:

       sales
apple    300
date     400

Here, we filter the rows based on the values of the index. The condition specifies that we want to keep rows where the index is either ‘apple’ or ‘date’, demonstrating how to use logical operators in your query string.

Method 4: Complex query with multiple conditions

Sometimes you may need to filter your data based on more than one column. The query function allows you to specify complex conditions that involve multiple columns using logical operators such as & (and), | (or), and ~ (not).

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'product': ['apple', 'banana', 'cherry', 'date'],
    'sales': [300, 120, 150, 400],
    'in_stock': [True, False, True, False]
})

# Filtering using complex query
filtered_df = df.query('sales > 150 & in_stock == True')
print(filtered_df)

Output:

  product  sales  in_stock
0   apple    300      True

The code combines two conditions: ‘sales’ greater than 150 and ‘in_stock’ being True. We use the & logical operator to ensure both conditions must be met for a row to be included in the filtered DataFrame.

Bonus One-Liner Method 5: Chainable Query

For more advanced use, you can chain multiple query calls together. This is useful when each step of filtering is complex or if you need to apply conditions sequentially. Remember that each query call returns a new DataFrame, allowing for the chain of methods without impacting the original DataFrame.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'product': ['apple', 'banana', 'cherry', 'date'],
    'sales': [300, 120, 150, 400],
    'in_stock': [True, True, True, False]
})

# Chaining queries
filtered_df = df.query('in_stock == True').query('sales > 150')
print(filtered_df)

Output:

  product  sales  in_stock
0   apple    300      True

This example showcases how chaining works in practice: first filtering ‘in_stock’ products and then among those, selecting ones with ‘sales’ greater than 150. Each step narrows down the dataset until the desired subset is reached.

Summary/Discussion

Method 1: Basic Query Filtering. Easy to use for simple conditions. Limited to single-string expressions.
Method 2: Query with Variables. Enables dynamic filtering with variables. Requires understanding of the special @ syntax.
Method 3: Query with Index. Ideal for index-based filtering. Not suitable when index filtering isn’t relevant.
Method 4: Complex Query with Multiple Conditions. Can handle numerous and complex conditions. The syntax can be challenging for beginners.
Method 5: Chainable Query. Offers step-wise filtering. Can make the code harder to read if overused.