5 Best Ways to Merge Pandas DataFrames with Inner Join in Python

πŸ’‘ Problem Formulation: When working with tabular data in Python, we often need to combine two datasets on a common set of keys to perform analyses on intersection points. For instance, we might have two separate DataFrames containing employee details and department information, and we want to merge them based on their department IDs to only include employees for whom department data exists. An inner join is perfect for these scenarios; this article explores five methods to achieve this using Pandas.

Method 1: Using pd.merge() function

The pd.merge() function in Pandas is the most straightforward method to perform an inner join. It takes two DataFrames and merges them based on common columns or indices, and by default, performs an inner join.

Here’s an example:

import pandas as pd

df1 = pd.DataFrame({'employee_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'employee_id': [2, 3, 4], 'department': ['HR', 'Engineering', 'Marketing']})

merged_df = pd.merge(df1, df2, on='employee_id')
print(merged_df)

The output of this code snippet will be:

   employee_id     name  department
0            2      Bob          HR
1            3  Charlie  Engineering

This code snippet merges df1 and df2 on the ’employee_id’ column. Only the rows with common ’employee_id’ in both the DataFrames are included in the resulting DataFrame, respecting the nature of an inner join.

Method 2: Using DataFrame’s merge() method

Pandas DataFrames have a merge() method that behaves similarly to the pd.merge() function. It offers a more object-oriented approach to merging DataFrames.

Here’s an example:

merged_df = df1.merge(df2, on='employee_id')
print(merged_df)

The output will be identical to Method 1:

   employee_id     name  department
0            2      Bob          HR
1            3  Charlie  Engineering

By calling df1.merge(df2), we merge df2 into df1 using ’employee_id’ as a key. This reflects an inner join as it only includes records with matching keys in both DataFrames.

Method 3: Specifying how='inner' explicitly

Though inner join is the default merge type in Pandas, it can be explicitly specified using the how parameter in the merge() function for clarity and readability, especially for those new to Pandas or SQL.

Here’s an example:

merged_df = pd.merge(df1, df2, on='employee_id', how='inner')
print(merged_df)

The result is:

   employee_id     name  department
0            2      Bob          HR
1            3  Charlie  Engineering

This snippet does the same inner join but explicitly sets how='inner' to indicate the type of join, making the code self-documenting and easier to understand at a glance.

Method 4: Using Concatenation Followed by Dropna

While not a conventional method, you can simulate an inner join by concatenating DataFrames vertically and then dropping rows with NaN values, which are the result of non-matching entries. However, this requires the DataFrames to have identical column names and is less efficient.

Here’s an example:

concatenated_df = pd.concat([df1.set_index('employee_id'), df2.set_index('employee_id')], axis=1)
merged_df = concatenated_df.dropna().reset_index()
print(merged_df)

This results in:

   employee_id     name  department
0            2      Bob          HR
1            3  Charlie  Engineering

This concatenates df1 and df2 along the columns, aligning them on the ’employee_id’ index. By dropping rows with NaN values, it mimics an inner join, with the reset_index restoring the employee_id as a column.

Bonus One-Liner Method 5: Using join() with set_index()

The join() method performs a join on the indices of two DataFrames. By setting a common column as the index for both DataFrames, you can effectively perform an inner join.

Here’s an example:

merged_df = df1.set_index('employee_id').join(df2.set_index('employee_id'), how='inner').reset_index()
print(merged_df)

The output:

   employee_id     name  department
0            2      Bob          HR
1            3  Charlie  Engineering

This snippet sets ’employee_id’ as the index for both DataFrames and then joins them. The how='inner' parameter ensures that only the intersection is included. The reset_index() call is used to move ’employee_id’ back to the columns.

Summary/Discussion

  • Method 1: pd.merge() function. Widely used. Easy to understand. Default inner join behavior may be unclear to newcomers.
  • Method 2: DataFrame’s merge() method. Object-oriented approach. Same behavior as pd.merge(). Preference can depend on coding style.
  • Method 3: Specifying how=’inner’. Explicit inner join. Enhances code readability. Redundant as inner join is the default.
  • Method 4: Concatenation + Dropna. Simulates inner join. Less efficient and requires identical column names. Not recommended for usual cases.
  • Method 5: join() + set_index(). Concise one-liner. Requires setting and resetting index. Useful for quick joins on indices.