5 Best Ways to Create a Subset of Columns Using Filter in Python

πŸ’‘ Problem Formulation: In data analysis with Python, a common task is to create a subset of columns from a larger DataFrame based on specific criteria. This process is usually done for data preprocessing, cleaning, and simply because working with fewer columns can make data analysis more focused and efficient. For example, you might start with a DataFrame containing various columns of data and want to create a new DataFrame that only includes columns whose names start with ‘sales_’ or contain the keyword ‘revenue’.

Method 1: Using the DataFrame filter() function with like parameter

The filter() function in pandas is versatile and allows you to select columns based on their names. The like parameter can be used to specify a substring that the column names should contain.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'sales_jan': [200, 250],
    'sales_feb': [220, 260],
    'revenue_mar': [140, 150],
    'profit_apr': [100, 120]
})

# Create subset of columns
filtered_df = df.filter(like='sales_')

Output:

   sales_jan  sales_feb
0       200       220
1       250       260

This snippet creates a new DataFrame filtered_df that includes only those columns from the original DataFrame df whose names contain ‘sales_’. The filter() function is straightforward and useful when you want to match part of a column name.

Method 2: Using the DataFrame filter() function with regex parameter

The filter() function also supports the regex parameter, enabling you to use regular expressions for complex patterns when selecting column names.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'sales_jan': [200, 250],
    'sales_feb': [220, 260],
    'revenue_mar': [140, 150],
    'profit_apr': [100, 120]
})

# Create subset of columns using regex
filtered_df = df.filter(regex='^(sales_|revenue_)')

Output:

   sales_jan  sales_feb  revenue_mar
0       200       220         140
1       250       260         150

This code snippet filters columns by using a regular expression. The ^(sales_|revenue_) pattern matches columns starting with either ‘sales_’ or ‘revenue_’. The resulting DataFrame filtered_df only includes the matched columns.

Method 3: Selecting columns with list comprehension

List comprehension in Python can be used with a DataFrame to create a list of column names that match a specific condition, and then use it to filter the DataFrame.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'sales_jan': [200, 250],
    'sales_feb': [220, 260],
    'revenue_mar': [140, 150],
    'profit_apr': [100, 120]
})

# Create subset of columns by list comprehension
columns = [col for col in df.columns if 'sales_' in col]
filtered_df = df[columns]

Output:

   sales_jan  sales_feb
0       200       220
1       250       260

In this example, list comprehension is used to iterate over the column names of df and create a list of columns that include ‘sales_’. The new DataFrame filtered_df is then created using this list.

Method 4: Using the DataFrame.loc[] accessor

The .loc[] accessor in pandas is a powerful tool that allows you to select data by label. You can use it to filter columns by passing a condition inside the accessor.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'sales_jan': [200, 250],
    'sales_feb': [220, 260],
    'revenue_mar': [140, 150],
    'profit_apr': [100, 120]
})

# Create subset of columns using .loc[]
filtered_df = df.loc[:, df.columns.str.startswith('sales_')]

Output:

   sales_jan  sales_feb
0       200       220
1       250       260

This snippet uses the .loc[] accessor to filter columns. The expression df.columns.str.startswith('sales_') evaluates to a boolean array that the accessor uses to select columns from df that start with ‘sales_’.

Bonus One-Liner Method 5: Using a DataFrame constructor with a dictionary comprehension

You can combine a dictionary comprehension with a DataFrame constructor to filter columns based on a condition, resulting in a concise and elegant one-liner.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'sales_jan': [200, 250],
    'sales_feb': [220, 260],
    'revenue_mar': [140, 150],
    'profit_apr': [100, 120]
})

# One-liner to create a subset of columns
filtered_df = pd.DataFrame({k: v for k, v in df.items() if 'sales_' in k})

Output:

   sales_jan  sales_feb
0       200       220
1       250       260

This one-liner iterates through the df.items(), which returns column names and content as key-value pairs, and includes only the ones matching the condition into a new dictionary that is then passed to the DataFrame constructor.

Summary/Discussion

  • Method 1: DataFrame filter() with like parameter. Useful for simple substring matches. Does not handle complex patterns.
  • Method 2: DataFrame filter() with regex parameter. Supports complex patterns and is very powerful, but can be slower and requires regex knowledge.
  • Method 3: List comprehension. Versatile and readable. However, may not be as quick as built-in pandas methods for large DataFrames.
  • Method 4: DataFrame.loc[]. Precise label-based selection. Can be very intuitive but may not be as straightforward as the filter() function for beginners.
  • Method 5: One-liner with DataFrame constructor. Elegant and compact. Works well for simple conditions but can be less readable for complex filtering operations.