Pandas merge() – A Simple Illustrated Guide with Video

Rate this post

In this tutorial, we will learn about the Pandas merge() function. Described in one sentence, the merge() function is used to combine datasets in various ways.

As you go through the tutorial, you can watch the following video guide for ease of understanding:

Syntax and Parameters

pandas.merge(left, right, how='inner', on=None, left_on=None, 
             right_on=None, left_index=False, right_index=False, 
             sort=False, suffixes=('_x', '_y'), copy=True, 
             indicator=False, validate=None)

Here are the parameters from the official documentation:

ParameterTypeDescription
leftDataFrame
rightDataFrame or SeriesData frame to merge with
how{'left', 'right', 'outer',
'inner', 'cross'}
, default 'inner'
Merge type to perform:
left: only use the keys from the left DataFrame
right: only use the keys from the right DataFrame
outer: use common keys from both DataFrames
inner: use overlap of keys from both DataFrames
cross: cartesian product from both DataFrames
onlabel or listColumn or index level names to join on.
Must be contained in both DataFrames.
left_onlabel, or list, or
array-like
Column or index level names to join on in the left DataFrame.
right_onlabel, or list, or
array-like
Column or index level names to join on in the right DataFrame.
left_indexbool, default False Use index from the left DataFrame as join key(s).
right_indexbool, default False Use index from the right DataFrame as join key(s).
sortbool, default FalseSort the join keys lexicographically in the resulting DataFrame. If set to False, the order of the join keys depends on the join type.
suffixeslist-like, default is
("_x", "_y")
A length-2 sequence where each element is
optionally a string indicating the suffix to add to overlapping column names in left and right respectively. At least one of the values must not be None.
copybool, default TrueIf False, avoid a copy if possible.
indicatorbool or str, default
False
If set to True, adds a column to the output DataFrame called "_merge" containing information of the source of each row. The column can be given a different name by providing a string as argument.
validatestr, optionalIf used, checks if merge is of a specified type.
“one_to_one” or “1:1”: checks if merged keys are unique in both left and right datasets.
“one_to_many” or “1:m”: checks if merged keys are unique in left dataset.
“many_to_one” or “m:1”: checks if merged keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but it does not result in checks.

The return value of the merge() function is a DataFrame consisting of the two merged objects.

Basic Example

To get started, we will first create two data frames that we will be merging in several ways throughout this tutorial:

import pandas as pd

df1 = pd.DataFrame({'Player': ['Jeremy', 'Alice', 'Bob', 'John', 'Mary'],
                    'Age': [31, 25, 27, 28, 21]})
df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary'],
                    'Position': ['Pitcher', 'Catcher', 'Center Field'],
                    'Throwing Speed': [71, 80, 81]})
print(df1)
print()
print(df2)

Output:

PlayerAge
0Jeremy31
1Alice25
2Bob27
3John28
4Mary21

… and …

PlayerPositionThrowing Speed
0AlicePitcher71
1JohnCatcher80
2MaryCenter Field81

First, we import the Pandas library. Then we create the two data frames “df1” and “df2“. The first data frame contains the player’s names of a Baseball team, as well as the player’s age.

The second data frame also contains a part of the player’s names of the first data frame and these player’s position and their throwing speed.

We finally output the data frames and see the mentioned information in a compact way.

Now, we apply the merge() function:

pd.merge(df1, df2, on="Player")

Output:

PlayerAgePositionThrowing Speed
0Alice25Pitcher71
1John28Catcher80
2Mary21Center Field81

The first two arguments are the names of the data frames that we want to merge. The third argument is the “on” parameter. The “on” parameter expects the column names to join on and we set it equal to “Player”.

Thus, Pandas merges these data frames on the “Player” column. The merged data frame only contains the players “Alice”, “John”, and “Mary” because these are the only players contained in both data frames. So, “Jeremy” and “Bob” from the first data frame are dropped.

The “left_on” and “right_on” Parameters

For this section, we will modify the data frame “df2” a little bit:

df2 = pd.DataFrame({'Name': ['Alice', 'John', 'Mary'],
                    'Position': ['Pitcher', 'Catcher', 'Center Field'],
                    'Throwing Speed': [71, 80, 81]})
print(df2)

Output:

NamePositionThrowing Speed
0AlicePitcher71
1JohnCatcher80
2MaryCenter Field81

The only difference is that we changed the label of the “Player” column to “Name”.

Now, we want to merge the data frames “df1” and “df2” again. However, we cannot do so by applying the “on” parameter and assigning it to “Player” since “df2” does not have a “Player” column anymore.

Therefore, we use the two parameters “left_on” and “right_on“. We set the “left_on” parameter equal to the column label that we want to use for merging from the first data frame and we do the same with the “right_on” parameter for the second data frame:

pd.merge(df1, df2, left_on="Player", right_on="Name")

Output:

PlayerAgeNamePositionThrowing Speed
0Alice25AlicePitcher71
1John28JohnCatcher80
2Mary21MaryCenter Field81

This way, we can merge data frames by columns with different column labels.

Since the “Player” column and the “Name” column contain the same information, we might want to get rid of one of them:

df = pd.merge(df1, df2, left_on="Player", right_on="Name")
df = df.drop("Name", axis=1)
print(df)

Output:

PlayerAgePositionThrowing Speed
0Alice25Pitcher71
1John28Catcher80
2Mary21Center Field81

We assign the merged data frame to a new variable called “df” and then we apply the drop() function and assign it the “Name” column. The “axis” parameter is set to “1” to state that we want to drop a column and not a row.

The outputted data frame now misses the “Name” column.

Merge Using Different Joins

In this next step, we will learn about the different types of merges and how to apply them using the “how” parameter.

Therefore, we change “df2” again. We rename the “Name” column back to “Player”. Also, we add two new players, “Jane” and “Mick”:

df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary', 
                               'Jane', 'Mick'],
                    'Position': ['Pitcher', 'Catcher', 
                                 'Center Field', 'Pitcher', 'Catcher'],
                    'Throwing Speed': [71, 80, 81, 79, 75]})
PlayerPositionThrowing Speed
0AlicePitcher71
1JohnCatcher80
2MaryCenter Field81
3JanePitcher79
4MickCatcher75

df1” still looks like this:

print(df1)
PlayerAge
0Jeremy31
1Alice25
2Bob27
3John28
4Mary21

We start with the so-called “inner” join.

Here, we use the intersection of keys from both our data frames:

pd.merge(df1, df2, how="inner", on="Player")
PlayerAgePositionThrowing Speed
0Alice25Pitcher71
1John28Catcher80
2Mary21Center Field81

As before, we assign the “on” parameter the value “Player” to specify what column we want to join on. We set the “how” parameter equal to "inner" to state that we want to perform an inner join.

The outputted data frame contains only the players that occur in both data frames. When we compare that merge to the merge we did in the first section, we can see that they are the same. That’s because "inner" is the default value for the “how” parameter.

The next type of merge we are looking at is the “outer” join.

The outer join is the union of keys from both our data frames:

pd.merge(df1, df2, how="outer", on="Player")
PlayerAgePositionThrowing Speed
0Jeremy31.0NaNNaN
1Alice25.0Pitcher71.0
2Bob27.0NaNNaN
3John28.0Catcher80.0
4Mary21.0Center Field81.0
5JaneNaNPitcher79.0
6MickNaNCatcher75.0

The data frame contains all players from both data frames. Bob, for example, has no value for position and throwing speed because he is only contained in “df1” where we don’t get position and throwing speed values. Similarly, Jane does not have an age value here since she is only found in “df2” which does not provide age information.

The next merge type is the "left" join.

Here, we use the keys from the left data frame exclusively:

pd.merge(df1, df2, how="left", on="Player")
PlayerAgePositionThrowing Speed
0Jeremy31NaNNaN
1Alice25Pitcher71.0
2Bob27NaNNaN
3John28Catcher80.0
4Mary21Center Field81.0

This data frame contains all the players from the left data frame which is “df1” in our case. Thus, Jeremy and Bob have no position and throwing speed values.

The "right" join is similar to the left join:

We are using the keys from the right data frame only:

pd.merge(df1, df2, how="right", on="Player")
PlayerAgePositionThrowing Speed
0Alice25.0Pitcher71
1John28.0Catcher80
2Mary21.0Center Field81
3JaneNaNPitcher79
4MickNaNCatcher75

After merging, the data frame contains all the players from the right data frame which is “df2“. That’s why Jane and Mick have no age values here.

The last join, we are learning about, is a bit special. It is called “cross” join and it creates the cartesian product from both data frames while keeping the order of the keys from the left data frame:

pd.merge(df1, df2, how="cross")
Player_xAgePlayer_yPositionThrowing Speed
0Jeremy31AlicePitcher71
1Jeremy31JohnCatcher80
2Jeremy31MaryCenter Field81
3Jeremy31JanePitcher79
4Jeremy31MickCatcher75
5Alice25AlicePitcher71
6Alice25JohnCatcher80
7Alice25MaryCenter Field81
8Alice25JanePitcher79
9Alice25MickCatcher75
10Bob27AlicePitcher71
11Bob27JohnCatcher80
12Bob27MaryCenter Field81
13Bob27JanePitcher79
14Bob27MickCatcher75
15John28AlicePitcher71
16John28JohnCatcher80
17John28MaryCenter Field81
18John28JanePitcher79
19John28MickCatcher75
20Mary21AlicePitcher71
21Mary21JohnCatcher80
22Mary21MaryCenter Field81
23Mary21JanePitcher79
24Mary21MickCatcher75

We can observe that we have two “Player” columns now, “Player_x” and “Player_y”. Each player of “df2” is assigned to each player of “df1“. Since both data frames contain five rows, the resulting data frame now has 25 rows (5×5).

The “indicator” Parameter

When we merge two data frames, it might be useful to gain information about the source of the merge keys, whether they were observed only in the left data frame, only in the right data frame, or in both. Therefore, we use the “indicator” parameter:

pd.merge(df1, df2, how="outer", on="Player", indicator=True)
PlayerAgePositionThrowing Speed_merge
0Jeremy31.0NaNNaNleft_only
1Alice25.0Pitcher71.0both
2Bob27.0NaNNaNleft_only
3John28.0Catcher80.0both
4Mary21.0Center Field81.0both
5JaneNaNPitcher79.0right_only
6MickNaNCatcher75.0right_only

We perform an outer join on the “Player” column and add the “indicator” and we set it to “True“. This way, we get an additional column called “_merge” with the entries “left_only“, “right_only“, and “both“.

For example, Jeremy gets assigned the value “left_only” because he only appears in the left data frame “df1“. And Mary’s “_merge” value is set to “both” because she is found in both data frames.

Summary

In this tutorial, we learned about the Pandas function merge(). We learned how to perform different kinds of merges using the function’s various parameters.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!