5 Best Ways to Fetch Columns Between Two Pandas DataFrames by Intersection

πŸ’‘ Problem Formulation: When working with data in Python, analysts often need to combine information from multiple Pandas DataFrames. A common task in this scenario is to identify and extract the columns common to two DataFrames, also known as the intersection. For instance, given two DataFrames with differing column sets, the output should be a new DataFrame or a set of columns that are present in both original DataFrames.

Method 1: Using DataFrame.columns Intersection

This method uses the intersection function to identify common columns between two DataFrames. It is straightforward and leverages the built-in Pandas functionality for index operations. This method is specifically designed to handle the index objects which DataFrame.columns are instances of.

Here’s an example:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame(columns=['A', 'B', 'C'])
df2 = pd.DataFrame(columns=['B', 'C', 'D'])

# Find common columns
common_columns = df1.columns.intersection(df2.columns)

# Output
print(common_columns)

Output:

Index(['B', 'C'], dtype='object')

In the example, we create two very simple DataFrames with different column sets and then call the intersection() method on the columns of the first DataFrame, passing the columns of the second DataFrame as an argument. The result is an Index object containing the common columns ‘B’ and ‘C’.

Method 2: Using Index & Operator

The & operator can be used on DataFrame column Index objects to perform set intersection. It’s a concise and effective way to find common columns, although it requires the columns to be set as index objects which is a default behavior for DataFrame columns.

Here’s an example:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame(columns=['A', 'B', 'C'])
df2 = pd.DataFrame(columns=['B', 'C', 'D'])

# Find common columns
common_columns = df1.columns & df2.columns

# Output
print(common_columns)

Output:

Index(['B', 'C'], dtype='object')

This snippet directly compares the column indexes using the & operator which performs an intersection. The result, as before, lists the common columns ‘B’ and ‘C’.

Method 3: Using Merge Function

The merge function in pandas is fundamentally designed for database-style joining of DataFrames, but it can also be used to find the intersection of columns by merging two DataFrames on all possible columns then extracting the resulting columns. It’s an indirect but efficient way to find common columns.

Here’s an example:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'A': [1], 'B': [2], 'C': [3]})
df2 = pd.DataFrame({'B': [5], 'C': [6], 'D': [7]})

# Find common columns by merging
common_df = pd.merge(df1, df2, how='inner')

# Get intersection columns
common_columns = common_df.columns

# Output
print(common_columns)

Output:

Index(['B', 'C'], dtype='object')

In this example, we perform an inner merge on the two DataFrames which results in a DataFrame containing only the common columns. We then extract these columns as an Index object.

Method 4: Using DataFrame.align Method

The align method can align two DataFrames with each other, and by setting join='inner', we can extract the intersection of their columns. Though the primary purpose of align is to align indices, this method effectively facilitates our requirement.

Here’s an example:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame(columns=['A', 'B', 'C'])
df2 = pd.DataFrame(columns=['B', 'C', 'D'])

# Align DataFrames to find common columns
_, _, common_columns = df1.align(df2, join='inner', axis=1)

# Output
print(common_columns)

Output:

Index(['B', 'C'], dtype='object')

By aligning df1 and df2 with join='inner' along axis=1 (columns), we discard non-matching columns, and the resulting common_columns carry ‘B’ and ‘C’ which are the intersected columns.

Bonus One-Liner Method 5: Using List Comprehension

Python’s list comprehension can also be employed to iterate over one DataFrame’s columns and select only those that are present in another DataFrame’s columns. This one-liner is intuitive and pythonic, though not as performant for larger datasets.

Here’s an example:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame(columns=['A', 'B', 'C'])
df2 = pd.DataFrame(columns=['B', 'C', 'D'])

# One-liner to find common columns
common_columns = [column for column in df1.columns if column in df2.columns]

# Output
print(common_columns)

Output:

['B', 'C']

This line filters df1.columns based on whether each column is included in df2.columns, thus producing the intersected list of columns ‘B’ and ‘C’.

Summary/Discussion

  • Method 1: Using DataFrame.columns Intersection. Straightforward and utilizes pandas’ built-in functionality. Less pythonic than other methods.
  • Method 2: Using Index & Operator. Concise and effective but slightly less intuitive for those unfamiliar with set operations on Index objects.
  • Method 3: Using Merge Function. Uses a more powerful function for a simple task, might be indirect but leverages pandas’ merging capabilities.
  • Method 4: Using DataFrame.align Method. Not commonly used for this purpose, but a clean and efficient way to perform this operation aligning to the inner join.
  • Bonus Method 5: Using List Comprehension. Pythonic and easy to write, but may be slower on larger datasets and is less idiomatic within pandas.