5 Best Ways to Find Common Elements in Two Pandas DataFrames

πŸ’‘ Problem Formulation: When working with data in Python, it’s a common scenario to need to identify common elements between two dataframes. For example, you might have two lists of user IDs from different sources and want to find which IDs are present in both lists. This article will explore five methods to find these intersections, aiming for an output that highlights the commonalities directly and efficiently.

Method 1: Using merge()

Merging dataframes on common columns is a straightforward approach to find common elements between two Pandas DataFrames. The merge() function in Pandas can be set to an inner join which will only retain rows from both dataframes that have matching values in the join columns.

β™₯️ Info: Are you AI curious but you still have to create real impactful projects? Join our official AI builder club on Skool (only $5): SHIP! - One Project Per Month

Here’s an example:

import pandas as pd

# Create two sample dataframes
df1 = pd.DataFrame({'id': [1, 2, 3], 'feature': ['a', 'b', 'c']})
df2 = pd.DataFrame({'id': [2, 3, 4], 'value': [10, 20, 30]})

# Find common elements
common = df1.merge(df2, on='id')

print(common)

Output:

   id feature  value
0   2       b     10
1   3       c     20

This code snippet creates two DataFrames and merges them on the ‘id’ column. The result is a new DataFrame containing only the rows with common ‘id’ values from both original DataFrames, essentially the intersection.

Method 2: Using isin() with Boolean Indexing

The isin() method coupled with boolean indexing provides a means to filter rows in one dataframe that contain values found in a particular column of another dataframe. It’s a method well-suited to identifying common elements.

Here’s an example:

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3, 4]})
df2 = pd.DataFrame({'id': [3, 4, 5, 6]})

# Use isin() for common elements
common_elements = df1[df1['id'].isin(df2['id'])]

print(common_elements)

Output:

   id
2   3
3   4

The code filters df1 for rows where the ‘id’ is present in df2['id'], using the isin() function and boolean indexing, thus yielding the common elements.

Method 3: Using concat() with drop_duplicates()

Concatenation of two dataframes followed by dropping duplicates is an indirect method to find intersection. First, concatenate the dataframes vertically then use drop_duplicates() to keep only the common rows, specifying keep=False to drop all duplicates.

Here’s an example:

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3]})
df2 = pd.DataFrame({'id': [2, 3, 4]})

# Concat and drop duplicates to find the common elements
combined = pd.concat([df1, df2])
common_elements = combined[combined.duplicated(keep=False)]

print(common_elements)

Output:

   id
1   2
2   3
0   2
1   3

This code concatenates df1 and df2 and then selects the duplicated rows, which are essentially the common elements between both dataframes. All duplicates are kept, which means the common items appear twice.

Method 4: Using numpy.intersect1d()

For finding the common elements between two columns in dataframes, the Numpy library provides a function called intersect1d(). It calculates the intersection of two arrays and returns the sorted, unique values that appear in both of them.

Here’s an example:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'id': [1, 2, 3]})
df2 = pd.DataFrame({'id': [2, 3, 4]})

# Find common elements using numpy
common_elements = np.intersect1d(df1['id'], df2['id'])

print(common_elements)

Output:

[2 3]

The snippet uses the intersect1d() function from Numpy to find the common ‘id’ values in both dataframes. It returns an array of these common IDs.

Bonus One-Liner Method 5: Using set Intersection

If you’re looking for a quick and straightforward one-liner, converting dataframe columns into sets and using set intersection is the most Pythonic way. It is simple, but only works for comparing single columns.

Here’s an example:

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3]})
df2 = pd.DataFrame({'id': [2, 3, 4]})

# Find common elements using set intersection
common_elements = set(df1['id']) & set(df2['id'])

print(common_elements)

Output:

{2, 3}

This code converts the ‘id’ columns from both dataframes into sets and then finds the intersection of these two sets, resulting in a set of the common elements.

Summary/Discussion

  • Method 1: merge(). Good for finding rows with multiple common columns. It can be overkill for simple intersections and is less efficient than other methods for large datasets.
  • Method 2: isin() with Boolean Indexing. Suitable for column-wise comparison and highly readable. It might be less efficient for large datasets compared to set operations.
  • Method 3: concat() with drop_duplicates(). It provides a quick way to find duplicates after combining dataframes but can be inefficient due to the initial concatenation of datasets.
  • Method 4: numpy.intersect1d(). Ideal for finding common elements within one-dimension arrays. It’s a cleaner and potentially faster approach for numerical comparisons.
  • Bonus Method 5: set Intersection. The most Pythonic and efficient for single column comparisons in small to medium-sized datasets. Not applicable for multiple column comparisons.