5 Best Ways to Merge Two Pandas DataFrames in Python

💡 Problem Formulation: When working with data in Python, it’s common to encounter situations where you need to combine two datasets. Suppose we have two DataFrames, df1 and df2, with related data but different information. We wish to merge these DataFrames in such a way that the final table encompasses all the information available from both, often matching rows based on common columns or indices. Below, we’ll explore five methods using pandas to achieve this, taking input DataFrames and merging them into a single, cohesive DataFrame.

Method 1: Using `pd.merge()` for SQL-Style Joins

This method allows users to combine DataFrames using SQL-like joins. The pd.merge() function is highly versatile and supports inner, outer, left, and right joins through its how parameter. It matches rows based on one or more keys, which can be specified using the on, left_on, and right_on parameters.

Here’s an example:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'key': [1, 2, 3], 'A': ['A0', 'A1', 'A2']})
df2 = pd.DataFrame({'key': [1, 2, 4], 'B': ['B0', 'B1', 'B3']})

# Merging DataFrames
result = pd.merge(df1, df2, on='key')

print(result)

The output:

   key   A   B
0    1  A0  B0
1    2  A1  B1

This code snippet demonstrates how to merge two DataFrames by a common column, key. The default how argument is set to inner, so the function only returns rows with matching keys in both DataFrames, excluding keys that do not match.

Method 2: Concatenating Along an Axis with `pd.concat()`

Concatenation is another way to merge data. The pd.concat() function stacks multiple DataFrames together along a specified axis (0 for index/rows, 1 for columns). It is useful when you want to combine DataFrames with similar structures or to append rows/columns from one DataFrame to another.

Here’s an example:

import pandas as pd

# Sample DataFrames with the same columns
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})

# Concatenating DataFrames
result = pd.concat([df1, df2])

print(result)

The output:

    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3

The code here effectively stacks df1 and df2 on top of each other, yielding a new DataFrame with combined rows. Remember that the indices are preserved from the original DataFrames, which can sometimes result in duplicate index labels.

Method 3: Combining Data With Overlapping Columns Using `pd.DataFrame.combine_first()`

The combine_first() method of a DataFrame allows you to combine two DataFrames where one fills in values that are missing in the other. This is particularly useful when one DataFrame has a subset of the other’s column and index labels.

Here’s an example:

import pandas as pd

# Sample DataFrames with overlapping columns
df1 = pd.DataFrame({'A': ['A0', None, 'A2'], 'B': ['B0', 'B1', None]})
df2 = pd.DataFrame({'A': [None, 'A1', 'A3'], 'B': ['B0', None, 'B2']})

# Combine DataFrames
result = df1.combine_first(df2)

print(result)

The output:

This example combines df1 and df2, with df1 taking precedence. Wherever there are missing values in df1, it looks to df2 for the corresponding value. The result fills all the missing data between the two DataFrames using non-NA values from df2.

Method 4: Appending Rows with `DataFrame.append()`

The append() method provides a quick way to add rows from one DataFrame to another, without the need for a more complex merge or concatenation. This method is especially useful for sequential data entry or when you have data in a list format.

Here’s an example:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})

# Appending df2 to df1
result = df1.append(df2)

print(result)

The output:

    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3

Here, df2 is appended to df1, creating a DataFrame where the rows of df2 follow the rows of df1. Similar to pd.concat(), the indices of the original DataFrames are preserved.

Bonus One-Liner Method 5: DataFrame Merge Using a Lambda Function

For a concise one-liner merging of two DataFrames, you can use a lambda function. This can be especially handy for inline operations where you may not need the full functionality of a merge but still require the combination of two datasets.

Here’s an example:

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'key': [1, 2]})
df2 = pd.DataFrame({'B': ['B0', 'B1'], 'key': [1, 2]})

# One-liner merge using a lambda function
result = (lambda x, y: pd.merge(x, y, on='key'))(df1, df2)

print(result)

The output:

    A  key   B
0  A0    1  B0
1  A1    2  B1

This single line of code performs a merge on DataFrames df1 and df2 based on the ‘key’ column. Utilizing a lambda function provides an inline and condensed way to execute the merge operation without resorting to a more verbose syntax.

Summary/Discussion

Method 1: Using pd.merge() for SQL-Style Joins. Offers the capability to perform complex joins like those in SQL. However, it requires some understanding of join types.
Method 2: Concatenating with pd.concat(). Ideal for stacking DataFrames vertically or horizontally with the same schema. But watch out for non-matching columns and duplicate indices.
Method 3: pd.DataFrame.combine_first(). Useful for overlapping DataFrames where one supplements missing data of the other. It doesn’t work well with DataFrames that have no overlapping indexes or columns.
Method 4: Appending Rows with DataFrame.append(). Simple way to add rows; however, it can be less efficient than pd.concat() for large DataFrames and also retains index labels.
Bonus One-Liner Method 5: Lambda Function Merge. Quick and inline, but not as readable or flexible for more complex merging operations.

Method 1: Using pd.merge() for SQL-Style Joins

Method 2: Concatenating Along an Axis with pd.concat()

Method 3: Combining Data With Overlapping Columns Using pd.DataFrame.combine_first()

Method 4: Appending Rows with DataFrame.append()

Bonus One-Liner Method 5: DataFrame Merge Using a Lambda Function

Summary/Discussion

Method 1: Using `pd.merge()` for SQL-Style Joins

Method 2: Concatenating Along an Axis with `pd.concat()`

Method 3: Combining Data With Overlapping Columns Using `pd.DataFrame.combine_first()`

Method 4: Appending Rows with `DataFrame.append()`