In this tutorial, we will learn about the Pandas merge()
function. Described in one sentence, the merge()
function is used to combine datasets in various ways.
As you go through the tutorial, you can watch the following video guide for ease of understanding:
Syntax and Parameters
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Here are the parameters from the official documentation:
Parameter | Type | Description |
left | DataFrame | |
right | DataFrame or Series | Data frame to merge with |
how | {'left', 'right', 'outer', , default 'inner' | Merge type to perform: left: only use the keys from the left DataFrame right: only use the keys from the right DataFrame outer: use common keys from both DataFrames inner: use overlap of keys from both DataFrames cross: cartesian product from both DataFrames |
on | label or list | Column or index level names to join on. Must be contained in both DataFrames. |
left_on | label, or list, or array-like | Column or index level names to join on in the left DataFrame. |
right_on | label, or list, or array-like | Column or index level names to join on in the right DataFrame. |
left_index | bool , default False | Use index from the left DataFrame as join key(s). |
right_index | bool , default False | Use index from the right DataFrame as join key(s). |
sort | bool , default False | Sort the join keys lexicographically in the resulting DataFrame. If set to False , the order of the join keys depends on the join type. |
suffixes | list-like, default is ( "_x" , "_y" ) | A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. At least one of the values must not be None . |
copy | bool , default True | If False , avoid a copy if possible. |
indicator | bool or str , defaultFalse | If set to True , adds a column to the output DataFrame called "_merge" containing information of the source of each row. The column can be given a different name by providing a string as argument. |
validate | str , optional | If used, checks if merge is of a specified type. โone_to_oneโ or โ1:1โ: checks if merged keys are unique in both left and right datasets. โone_to_manyโ or โ1:mโ: checks if merged keys are unique in left dataset. โmany_to_oneโ or โm:1โ: checks if merged keys are unique in right dataset. โmany_to_manyโ or โm:mโ: allowed, but it does not result in checks. |
The return value of the merge()
function is a DataFrame consisting of the two merged objects.
Basic Example
To get started, we will first create two data frames that we will be merging in several ways throughout this tutorial:
import pandas as pd df1 = pd.DataFrame({'Player': ['Jeremy', 'Alice', 'Bob', 'John', 'Mary'], 'Age': [31, 25, 27, 28, 21]}) df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary'], 'Position': ['Pitcher', 'Catcher', 'Center Field'], 'Throwing Speed': [71, 80, 81]}) print(df1) print() print(df2)
Output:
Player | Age | ||
0 | Jeremy | 31 | |
1 | Alice | 25 | |
2 | Bob | 27 | |
3 | John | 28 | |
4 | Mary | 21 |
… and …
Player | Position | Throwing Speed | |
0 | Alice | Pitcher | 71 |
1 | John | Catcher | 80 |
2 | Mary | Center Field | 81 |
First, we import the Pandas library. Then we create the two data frames “df1
” and “df2
“. The first data frame contains the player’s names of a Baseball team, as well as the player’s age.
The second data frame also contains a part of the player’s names of the first data frame and these player’s position and their throwing speed.
We finally output the data frames and see the mentioned information in a compact way.
Now, we apply the merge()
function:
pd.merge(df1, df2, on="Player")
Output:
Player | Age | Position | Throwing Speed | |
0 | Alice | 25 | Pitcher | 71 |
1 | John | 28 | Catcher | 80 |
2 | Mary | 21 | Center Field | 81 |
The first two arguments are the names of the data frames that we want to merge. The third argument is the “on
” parameter. The “on
” parameter expects the column names to join on and we set it equal to “Player”.
Thus, Pandas merges these data frames on the “Player” column. The merged data frame only contains the players “Alice”, “John”, and “Mary” because these are the only players contained in both data frames. So, “Jeremy” and “Bob” from the first data frame are dropped.
The “left_on” and “right_on” Parameters
For this section, we will modify the data frame “df2
” a little bit:
df2 = pd.DataFrame({'Name': ['Alice', 'John', 'Mary'], 'Position': ['Pitcher', 'Catcher', 'Center Field'], 'Throwing Speed': [71, 80, 81]}) print(df2)
Output:
Name | Position | Throwing Speed | |
0 | Alice | Pitcher | 71 |
1 | John | Catcher | 80 |
2 | Mary | Center Field | 81 |
The only difference is that we changed the label of the “Player” column to “Name”.
Now, we want to merge the data frames “df1
” and “df2
” again. However, we cannot do so by applying the “on
” parameter and assigning it to “Player” since “df2
” does not have a “Player” column anymore.
Therefore, we use the two parameters “left_on
” and “right_on
“. We set the “left_on
” parameter equal to the column label that we want to use for merging from the first data frame and we do the same with the “right_on
” parameter for the second data frame:
pd.merge(df1, df2, left_on="Player", right_on="Name")
Output:
Player | Age | Name | Position | Throwing Speed | |
0 | Alice | 25 | Alice | Pitcher | 71 |
1 | John | 28 | John | Catcher | 80 |
2 | Mary | 21 | Mary | Center Field | 81 |
This way, we can merge data frames by columns with different column labels.
Since the “Player” column and the “Name” column contain the same information, we might want to get rid of one of them:
df = pd.merge(df1, df2, left_on="Player", right_on="Name") df = df.drop("Name", axis=1) print(df)
Output:
Player | Age | Position | Throwing Speed | |
0 | Alice | 25 | Pitcher | 71 |
1 | John | 28 | Catcher | 80 |
2 | Mary | 21 | Center Field | 81 |
We assign the merged data frame to a new variable called “df
” and then we apply the drop()
function and assign it the “Name” column. The “axis
” parameter is set to “1” to state that we want to drop a column and not a row.
The outputted data frame now misses the “Name” column.
Merge Using Different Joins
In this next step, we will learn about the different types of merges and how to apply them using the “how
” parameter.
Therefore, we change “df2
” again. We rename the “Name” column back to “Player”. Also, we add two new players, “Jane” and “Mick”:
df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary', 'Jane', 'Mick'], 'Position': ['Pitcher', 'Catcher', 'Center Field', 'Pitcher', 'Catcher'], 'Throwing Speed': [71, 80, 81, 79, 75]})
Player | Position | Throwing Speed | |
0 | Alice | Pitcher | 71 |
1 | John | Catcher | 80 |
2 | Mary | Center Field | 81 |
3 | Jane | Pitcher | 79 |
4 | Mick | Catcher | 75 |
“df1
” still looks like this:
print(df1)
Player | Age | |
0 | Jeremy | 31 |
1 | Alice | 25 |
2 | Bob | 27 |
3 | John | 28 |
4 | Mary | 21 |
We start with the so-called “inner” join.
Here, we use the intersection of keys from both our data frames:
pd.merge(df1, df2, how="inner", on="Player")
Player | Age | Position | Throwing Speed | |
0 | Alice | 25 | Pitcher | 71 |
1 | John | 28 | Catcher | 80 |
2 | Mary | 21 | Center Field | 81 |
As before, we assign the “on
” parameter the value “Player” to specify what column we want to join on. We set the “how
” parameter equal to "inner"
to state that we want to perform an inner join.
The outputted data frame contains only the players that occur in both data frames. When we compare that merge to the merge we did in the first section, we can see that they are the same. That’s because "inner"
is the default value for the “how
” parameter.
The next type of merge we are looking at is the “outer” join.
The outer join is the union of keys from both our data frames:
pd.merge(df1, df2, how="outer", on="Player")
Player | Age | Position | Throwing Speed | |
0 | Jeremy | 31.0 | NaN | NaN |
1 | Alice | 25.0 | Pitcher | 71.0 |
2 | Bob | 27.0 | NaN | NaN |
3 | John | 28.0 | Catcher | 80.0 |
4 | Mary | 21.0 | Center Field | 81.0 |
5 | Jane | NaN | Pitcher | 79.0 |
6 | Mick | NaN | Catcher | 75.0 |
The data frame contains all players from both data frames. Bob, for example, has no value for position and throwing speed because he is only contained in “df1
” where we don’t get position and throwing speed values. Similarly, Jane does not have an age value here since she is only found in “df2
” which does not provide age information.
The next merge type is the "left"
join.
Here, we use the keys from the left data frame exclusively:
pd.merge(df1, df2, how="left", on="Player")
Player | Age | Position | Throwing Speed | |
0 | Jeremy | 31 | NaN | NaN |
1 | Alice | 25 | Pitcher | 71.0 |
2 | Bob | 27 | NaN | NaN |
3 | John | 28 | Catcher | 80.0 |
4 | Mary | 21 | Center Field | 81.0 |
This data frame contains all the players from the left data frame which is “df1
” in our case. Thus, Jeremy and Bob have no position and throwing speed values.
The "right"
join is similar to the left join:
We are using the keys from the right data frame only:
pd.merge(df1, df2, how="right", on="Player")
Player | Age | Position | Throwing Speed | |
0 | Alice | 25.0 | Pitcher | 71 |
1 | John | 28.0 | Catcher | 80 |
2 | Mary | 21.0 | Center Field | 81 |
3 | Jane | NaN | Pitcher | 79 |
4 | Mick | NaN | Catcher | 75 |
After merging, the data frame contains all the players from the right data frame which is “df2
“. That’s why Jane and Mick have no age values here.
The last join, we are learning about, is a bit special. It is called “cross” join and it creates the cartesian product from both data frames while keeping the order of the keys from the left data frame:
pd.merge(df1, df2, how="cross")
Player_x | Age | Player_y | Position | Throwing Speed | |
0 | Jeremy | 31 | Alice | Pitcher | 71 |
1 | Jeremy | 31 | John | Catcher | 80 |
2 | Jeremy | 31 | Mary | Center Field | 81 |
3 | Jeremy | 31 | Jane | Pitcher | 79 |
4 | Jeremy | 31 | Mick | Catcher | 75 |
5 | Alice | 25 | Alice | Pitcher | 71 |
6 | Alice | 25 | John | Catcher | 80 |
7 | Alice | 25 | Mary | Center Field | 81 |
8 | Alice | 25 | Jane | Pitcher | 79 |
9 | Alice | 25 | Mick | Catcher | 75 |
10 | Bob | 27 | Alice | Pitcher | 71 |
11 | Bob | 27 | John | Catcher | 80 |
12 | Bob | 27 | Mary | Center Field | 81 |
13 | Bob | 27 | Jane | Pitcher | 79 |
14 | Bob | 27 | Mick | Catcher | 75 |
15 | John | 28 | Alice | Pitcher | 71 |
16 | John | 28 | John | Catcher | 80 |
17 | John | 28 | Mary | Center Field | 81 |
18 | John | 28 | Jane | Pitcher | 79 |
19 | John | 28 | Mick | Catcher | 75 |
20 | Mary | 21 | Alice | Pitcher | 71 |
21 | Mary | 21 | John | Catcher | 80 |
22 | Mary | 21 | Mary | Center Field | 81 |
23 | Mary | 21 | Jane | Pitcher | 79 |
24 | Mary | 21 | Mick | Catcher | 75 |
We can observe that we have two “Player” columns now, “Player_x” and “Player_y”. Each player of “df2
” is assigned to each player of “df1
“. Since both data frames contain five rows, the resulting data frame now has 25 rows (5×5).
The “indicator” Parameter
When we merge two data frames, it might be useful to gain information about the source of the merge keys, whether they were observed only in the left data frame, only in the right data frame, or in both. Therefore, we use the “indicator
” parameter:
pd.merge(df1, df2, how="outer", on="Player", indicator=True)
Player | Age | Position | Throwing Speed | _merge | |
0 | Jeremy | 31.0 | NaN | NaN | left_only |
1 | Alice | 25.0 | Pitcher | 71.0 | both |
2 | Bob | 27.0 | NaN | NaN | left_only |
3 | John | 28.0 | Catcher | 80.0 | both |
4 | Mary | 21.0 | Center Field | 81.0 | both |
5 | Jane | NaN | Pitcher | 79.0 | right_only |
6 | Mick | NaN | Catcher | 75.0 | right_only |
We perform an outer join on the “Player” column and add the “indicator” and we set it to “True
“. This way, we get an additional column called “_merge
” with the entries “left_only
“, “right_only
“, and “both
“.
For example, Jeremy gets assigned the value “left_only
” because he only appears in the left data frame “df1
“. And Mary’s “_merge
” value is set to “both
” because she is found in both data frames.
Summary
In this tutorial, we learned about the Pandas function merge()
. We learned how to perform different kinds of merges using the function’s various parameters.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.
Happy Coding!