Pandas merge() - A Simple Illustrated Guide with Video

In this tutorial, we will learn about the Pandas merge() function. Described in one sentence, the merge() function is used to combine datasets in various ways.

As you go through the tutorial, you can watch the following video guide for ease of understanding:

Syntax and Parameters

pandas.merge(left, right, how='inner', on=None, left_on=None, 
             right_on=None, left_index=False, right_index=False, 
             sort=False, suffixes=('_x', '_y'), copy=True, 
             indicator=False, validate=None)

Here are the parameters from the official documentation:

Parameter	Type	Description
`left`	DataFrame
`right`	DataFrame or Series	Data frame to merge with
`how`	`{'left', 'right', 'outer', 'inner', 'cross'}`, default `'inner'`	Merge type to perform: left: only use the keys from the left DataFrame right: only use the keys from the right DataFrame outer: use common keys from both DataFrames inner: use overlap of keys from both DataFrames cross: cartesian product from both DataFrames
`on`	label or list	Column or index level names to join on. Must be contained in both DataFrames.
`left_on`	label, or list, or array-like	Column or index level names to join on in the left DataFrame.
`right_on`	label, or list, or array-like	Column or index level names to join on in the right DataFrame.
`left_index`	`bool`, default `False`	Use index from the left DataFrame as join key(s).
`right_index`	`bool`, default `False`	Use index from the right DataFrame as join key(s).
`sort`	`bool`, default `False`	Sort the join keys lexicographically in the resulting DataFrame. If set to `False`, the order of the join keys depends on the join type.
`suffixes`	list-like, default is (`"_x"`, `"_y"`)	A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. At least one of the values must not be `None`.
`copy`	`bool`, default `True`	If `False`, avoid a copy if possible.
`indicator`	`bool` or `str`, default `False`	If set to `True`, adds a column to the output DataFrame called `"_merge"` containing information of the source of each row. The column can be given a different name by providing a string as argument.
`validate`	`str`, optional	If used, checks if merge is of a specified type. “one_to_one” or “1:1”: checks if merged keys are unique in both left and right datasets. “one_to_many” or “1:m”: checks if merged keys are unique in left dataset. “many_to_one” or “m:1”: checks if merged keys are unique in right dataset. “many_to_many” or “m:m”: allowed, but it does not result in checks.

The return value of the merge() function is a DataFrame consisting of the two merged objects.

Basic Example

To get started, we will first create two data frames that we will be merging in several ways throughout this tutorial:

import pandas as pd

df1 = pd.DataFrame({'Player': ['Jeremy', 'Alice', 'Bob', 'John', 'Mary'],
                    'Age': [31, 25, 27, 28, 21]})
df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary'],
                    'Position': ['Pitcher', 'Catcher', 'Center Field'],
                    'Throwing Speed': [71, 80, 81]})
print(df1)
print()
print(df2)

Output:

	Player	Age
0	Jeremy	31
1	Alice	25
2	Bob	27
3	John	28
4	Mary	21

… and …

	Player	Position	Throwing Speed
0	Alice	Pitcher	71
1	John	Catcher	80
2	Mary	Center Field	81

First, we import the Pandas library. Then we create the two data frames “df1” and “df2“. The first data frame contains the player’s names of a Baseball team, as well as the player’s age.

The second data frame also contains a part of the player’s names of the first data frame and these player’s position and their throwing speed.

We finally output the data frames and see the mentioned information in a compact way.

Now, we apply the merge() function:

pd.merge(df1, df2, on="Player")

Output:

	Player	Age	Position	Throwing Speed
0	Alice	25	Pitcher	71
1	John	28	Catcher	80
2	Mary	21	Center Field	81

The first two arguments are the names of the data frames that we want to merge. The third argument is the “on” parameter. The “on” parameter expects the column names to join on and we set it equal to “Player”.

Thus, Pandas merges these data frames on the “Player” column. The merged data frame only contains the players “Alice”, “John”, and “Mary” because these are the only players contained in both data frames. So, “Jeremy” and “Bob” from the first data frame are dropped.

The “left_on” and “right_on” Parameters

For this section, we will modify the data frame “df2” a little bit:

df2 = pd.DataFrame({'Name': ['Alice', 'John', 'Mary'],
                    'Position': ['Pitcher', 'Catcher', 'Center Field'],
                    'Throwing Speed': [71, 80, 81]})
print(df2)

Output:

	Name	Position	Throwing Speed
0	Alice	Pitcher	71
1	John	Catcher	80
2	Mary	Center Field	81

The only difference is that we changed the label of the “Player” column to “Name”.

Now, we want to merge the data frames “df1” and “df2” again. However, we cannot do so by applying the “on” parameter and assigning it to “Player” since “df2” does not have a “Player” column anymore.

Therefore, we use the two parameters “left_on” and “right_on“. We set the “left_on” parameter equal to the column label that we want to use for merging from the first data frame and we do the same with the “right_on” parameter for the second data frame:

pd.merge(df1, df2, left_on="Player", right_on="Name")

Output:

	Player	Age	Name	Position	Throwing Speed
0	Alice	25	Alice	Pitcher	71
1	John	28	John	Catcher	80
2	Mary	21	Mary	Center Field	81

This way, we can merge data frames by columns with different column labels.

Since the “Player” column and the “Name” column contain the same information, we might want to get rid of one of them:

df = pd.merge(df1, df2, left_on="Player", right_on="Name")
df = df.drop("Name", axis=1)
print(df)

Output:

	Player	Age	Position	Throwing Speed
0	Alice	25	Pitcher	71
1	John	28	Catcher	80
2	Mary	21	Center Field	81

We assign the merged data frame to a new variable called “df” and then we apply the drop() function and assign it the “Name” column. The “axis” parameter is set to “1” to state that we want to drop a column and not a row.

The outputted data frame now misses the “Name” column.

Merge Using Different Joins

In this next step, we will learn about the different types of merges and how to apply them using the “how” parameter.

Therefore, we change “df2” again. We rename the “Name” column back to “Player”. Also, we add two new players, “Jane” and “Mick”:

df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary', 
                               'Jane', 'Mick'],
                    'Position': ['Pitcher', 'Catcher', 
                                 'Center Field', 'Pitcher', 'Catcher'],
                    'Throwing Speed': [71, 80, 81, 79, 75]})

	Player	Position	Throwing Speed
0	Alice	Pitcher	71
1	John	Catcher	80
2	Mary	Center Field	81
3	Jane	Pitcher	79
4	Mick	Catcher	75

“df1” still looks like this:

print(df1)

	Player	Age
0	Jeremy	31
1	Alice	25
2	Bob	27
3	John	28
4	Mary	21

We start with the so-called “inner” join.

Here, we use the intersection of keys from both our data frames:

pd.merge(df1, df2, how="inner", on="Player")

	Player	Age	Position	Throwing Speed
0	Alice	25	Pitcher	71
1	John	28	Catcher	80
2	Mary	21	Center Field	81

As before, we assign the “on” parameter the value “Player” to specify what column we want to join on. We set the “how” parameter equal to "inner" to state that we want to perform an inner join.

The outputted data frame contains only the players that occur in both data frames. When we compare that merge to the merge we did in the first section, we can see that they are the same. That’s because "inner" is the default value for the “how” parameter.

The next type of merge we are looking at is the “outer” join.

The outer join is the union of keys from both our data frames:

pd.merge(df1, df2, how="outer", on="Player")

	Player	Age	Position	Throwing Speed
0	Jeremy	31.0	NaN	NaN
1	Alice	25.0	Pitcher	71.0
2	Bob	27.0	NaN	NaN
3	John	28.0	Catcher	80.0
4	Mary	21.0	Center Field	81.0
5	Jane	NaN	Pitcher	79.0
6	Mick	NaN	Catcher	75.0

The data frame contains all players from both data frames. Bob, for example, has no value for position and throwing speed because he is only contained in “df1” where we don’t get position and throwing speed values. Similarly, Jane does not have an age value here since she is only found in “df2” which does not provide age information.

The next merge type is the "left" join.

Here, we use the keys from the left data frame exclusively:

pd.merge(df1, df2, how="left", on="Player")

	Player	Age	Position	Throwing Speed
0	Jeremy	31	NaN	NaN
1	Alice	25	Pitcher	71.0
2	Bob	27	NaN	NaN
3	John	28	Catcher	80.0
4	Mary	21	Center Field	81.0

This data frame contains all the players from the left data frame which is “df1” in our case. Thus, Jeremy and Bob have no position and throwing speed values.

The "right" join is similar to the left join:

We are using the keys from the right data frame only:

pd.merge(df1, df2, how="right", on="Player")

	Player	Age	Position	Throwing Speed
0	Alice	25.0	Pitcher	71
1	John	28.0	Catcher	80
2	Mary	21.0	Center Field	81
3	Jane	NaN	Pitcher	79
4	Mick	NaN	Catcher	75

After merging, the data frame contains all the players from the right data frame which is “df2“. That’s why Jane and Mick have no age values here.

The last join, we are learning about, is a bit special. It is called “cross” join and it creates the cartesian product from both data frames while keeping the order of the keys from the left data frame:

pd.merge(df1, df2, how="cross")

	Player_x	Age	Player_y	Position	Throwing Speed
0	Jeremy	31	Alice	Pitcher	71
1	Jeremy	31	John	Catcher	80
2	Jeremy	31	Mary	Center Field	81
3	Jeremy	31	Jane	Pitcher	79
4	Jeremy	31	Mick	Catcher	75
5	Alice	25	Alice	Pitcher	71
6	Alice	25	John	Catcher	80
7	Alice	25	Mary	Center Field	81
8	Alice	25	Jane	Pitcher	79
9	Alice	25	Mick	Catcher	75
10	Bob	27	Alice	Pitcher	71
11	Bob	27	John	Catcher	80
12	Bob	27	Mary	Center Field	81
13	Bob	27	Jane	Pitcher	79
14	Bob	27	Mick	Catcher	75
15	John	28	Alice	Pitcher	71
16	John	28	John	Catcher	80
17	John	28	Mary	Center Field	81
18	John	28	Jane	Pitcher	79
19	John	28	Mick	Catcher	75
20	Mary	21	Alice	Pitcher	71
21	Mary	21	John	Catcher	80
22	Mary	21	Mary	Center Field	81
23	Mary	21	Jane	Pitcher	79
24	Mary	21	Mick	Catcher	75

We can observe that we have two “Player” columns now, “Player_x” and “Player_y”. Each player of “df2” is assigned to each player of “df1“. Since both data frames contain five rows, the resulting data frame now has 25 rows (5×5).

The “indicator” Parameter

When we merge two data frames, it might be useful to gain information about the source of the merge keys, whether they were observed only in the left data frame, only in the right data frame, or in both. Therefore, we use the “indicator” parameter:

pd.merge(df1, df2, how="outer", on="Player", indicator=True)

	Player	Age	Position	Throwing Speed	_merge
0	Jeremy	31.0	NaN	NaN	left_only
1	Alice	25.0	Pitcher	71.0	both
2	Bob	27.0	NaN	NaN	left_only
3	John	28.0	Catcher	80.0	both
4	Mary	21.0	Center Field	81.0	both
5	Jane	NaN	Pitcher	79.0	right_only
6	Mick	NaN	Catcher	75.0	right_only

We perform an outer join on the “Player” column and add the “indicator” and we set it to “True“. This way, we get an additional column called “_merge” with the entries “left_only“, “right_only“, and “both“.

For example, Jeremy gets assigned the value “left_only” because he only appears in the left data frame “df1“. And Mary’s “_merge” value is set to “both” because she is found in both data frames.

Summary

In this tutorial, we learned about the Pandas function merge(). We learned how to perform different kinds of merges using the function’s various parameters.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!