Pandas merge_ordered() - A Simple Guide with Video

In this tutorial, we will learn about the Pandas function merge_ordered(). This method performs a merge with optional interpolation. It is especially useful for ordered data like time series data.

Syntax and Parameters

Here are the parameters from the official documentation:

Parameter	Type	Description
`left`	DataFrame
`right`	DataFrame
`on`	label or list	Field names to join on. Must be contained in both DataFrames.
`left_on`	label or list, or array-like	Field names to join on in the left DataFrame.
`right_on`	label or list, or array-like	Field names to join on in the right DataFrame.
`left_by`	column name or list of column names	Group the left DataFrame by group columns and merge piece by piece with the right DataFrame.
`right_by`	column name or list of column names	Group the right DataFrame by group columns and merge piece by piece with the left DataFrame.
`fill_method`	`{'ffill', None}`, default: `None`	Interpolation method for data.
`suffixes`	list-like, default is (`"_x"`, `"_y"`)	A length-2 sequence where each element is optionally a string indicating the suffix to add to the overlapping column names in left and right respectively. A value of `None` instead of a string indicates that the column name from left or right should be left as it is. At least one of the values must not be `None`.
`how`	`{'left', 'right', 'outer', 'inner'}`, default `'outer'`	`left`: use keys from left data frame only `right`: use keys from right data frame only `outer`: use union of keys from both data frames `inner`: use intersection of keys from both data frames

Returns	Type	Description
	DataFrame	The merged DataFrame output type will the be same as ‘`left`’, if it is a subclass of DataFrame.

Basic Example

To get started, we will create two data frames:

import pandas as pd
df1 = pd.DataFrame({
    'Date': ['15/01/2019', '16/01/2019', '17/01/2019', '18/01/2019', 
    '19/01/2019', '20/01/2019'],
    'Price': [16.7, 18.4, 20.0, 19.3, 17.1, 21.2]
    })
print(df1)

	Date	Price
0	15.01.2019	16.7
1	16.01.2019	18.4
2	17.01.2019	20.0
3	18.01.2019	19.3
4	19.01.2019	17.1
5	20.01.2019	21.2

df2 = pd.DataFrame({
    'Date': ['15/01/2019', '17/01/2019', '18/01/2019', '20/01/2019', 
    '21/01/2019', '22/01/2019'],
    'Price': [14.6, 19.8, 21.9, 20.2, 17.4, 18.0]
})

print(df2)

	Date	Price
0	15.01.2019	14.6
1	17.01.2019	19.8
2	18.01.2019	21.9
3	20.01.2019	20.2
4	21.01.2019	17.4
5	22.01.2019	18.0

Here, we import the Pandas library as the first step. Then, we create the two data frames “df1” and “df2” which contain a “Date” column and a “Price” column respectively.

Now that we created these data frames, in the next step we can perform our first merge_ordered() operation:

pd.merge_ordered(df1, df2, on='Date')

	Date	Price_x	Price_y
0	15.01.2019	16.7	14.6
1	16.01.2019	18.4	NaN
2	17.01.2019	20.0	19.8
3	18.01.2019	19.3	21.9
4	19.01.2019	17.1	NaN
5	20.01.2019	21.2	20.2
6	21.01.2019	NaN	17.4
7	22.01.2019	NaN	18.0

We apply the merge_ordered() function and put in the two data frames as the first two arguments of the function. That’s because these are the data frames that we want to merge. The third parameter is the “on” parameter. This parameter expects the column or a list of columns that we want to perform the merge on. We choose the “Date” column here.

The outputted data frame is longer than each of the two initial data frames. That’s because, by default, the merge_ordered() function performs a so-called “outer” join. That means, we use the union of keys from both our data frames. Since there are eight unique dates, the resulting data frame has eight rows in total.

We also get two price columns: “Price_x” and “Price_y“. For each date, we get a price from the left data frame (“Price_x“) and the right data frame (“Price_y“). If there is a “NaN” value, that means, for this specific date, we have only one price value. For example, for the "16.01.2019", we do not get a “Price_y” value because this date is only found in the first data frame.

The “fill_method” parameter

As we saw in the example above, there were some missing values labeled with “NaN“:

	Date	Price_x	Price_y
0	15.01.2019	16.7	14.6
1	16.01.2019	18.4	NaN
2	17.01.2019	20.0	19.8
3	18.01.2019	19.3	21.9
4	19.01.2019	17.1	NaN
5	20.01.2019	21.2	20.2
6	21.01.2019	NaN	17.4
7	22.01.2019	NaN	18.0

We can get rid of these “NaN” values by replacing these values with the previous value. We achieve that by applying the “fill_method” parameter and assigning it to “ffill“:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill')

	Date	Price_x	Price_y
0	15.01.2019	16.7	14.6
1	16.01.2019	18.4	14.6
2	17.01.2019	20.0	19.8
3	18.01.2019	19.3	21.9
4	19.01.2019	17.1	21.9
5	20.01.2019	21.2	20.2
6	21.01.2019	21.2	17.4
7	22.01.2019	21.2	18.0

Now, we do not have any more missing values here. For example, the “NaN” value in the “Price_y” column for the date "16.01.2019" was replaced with the previous value from that column (“14.6“).

If multiple values are missing directly one after the other, all missing values get replaced by the last available value. For example, the last two values from the “Price_x” column were missing. They were both replaced by the value of the third last row which was "21.2".

The “suffixes” parameter

In the previous example, the two price columns were named “Price_x” and “Price_y” by default. However, we can change these labels by applying the “suffixes” parameter:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', suffixes=['_leftDF', '_rightDF'])

	Date	Price_leftDF	Price_rightDF
0	15.01.2019	16.7	14.6
1	16.01.2019	18.4	14.6
2	17.01.2019	20.0	19.8
3	18.01.2019	19.3	21.9
4	19.01.2019	17.1	21.9
5	20.01.2019	21.2	20.2
6	21.01.2019	21.2	17.4
7	22.01.2019	21.2	18.0

We performed the same merge_ordered() operation as before. But this time, we added the “suffixes” parameter and assigned it a list with the strings “_leftDF” and “_rightDF“. The two price columns in the resulting data frame are now called “Price_leftDF” and “Price_rightDF“.

As the name of the parameter suggests, we only change the suffixes here, not the whole label. That’s why the column labels still say “Price” before the suffixes because the initial column label said “Price” and we only added the suffixes after that label.

The different kinds of joins

As mentioned in the introduction, by default the merge_ordered() function performs an outer join. That means we take the union of keys from both data frames.

But we can change that by using the “how” parameter.

Another type of join is the “inner” join which uses the intersection of keys from both data frames:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='inner')

	Date	Price_x	Price_y
0	15.01.2019	16.7	14.6
1	17.01.2019	20.0	19.8
2	18.01.2019	19.3	21.9
3	20.01.2019	21.2	20.2

That means, we only get the dates that are found in both data frames.

The remaining two options the “how” parameter provides us with are the “left” join and the “right” join. The left join uses only the keys from the left data frame.

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='left')

	Date	Price_x	Price_y
0	15.01.2019	16.7	14.6
1	16.01.2019	18.4	14.6
2	17.01.2019	20.0	19.8
3	18.01.2019	19.3	21.9
4	19.01.2019	17.1	21.9
5	20.01.2019	21.2	20.2

Whereas the right join only uses the keys from the right data frame.

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='right')

	Date	Price_x	Price_y
0	15.01.2019	16.7	14.6
1	17.01.2019	20.0	19.8
2	18.01.2019	19.3	21.9
3	20.01.2019	21.2	20.2
4	21.01.2019	21.2	17.4
5	22.01.2019	21.2	18.0

Grouping by group columns

For this section, we will modify “df1” a little bit:

df1['Category'] = ['A', 'A', 'A', 'B', 'B', 'B']
print(df1)

	Date	Price	Category
0	15.01.2019	16.7	A
1	16.01.2019	18.4	A
2	17.01.2019	20.0	A
3	18.01.2019	19.3	B
4	19.01.2019	17.1	B
5	20.01.2019	21.2	B

We added a column called “Category” and assigned the categories “A” or “B” to each row.

“df2” remains unchanged:

	Date	Price
0	15.01.2019	14.6
1	17.01.2019	19.8
2	18.01.2019	21.9
3	20.01.2019	20.2
4	21.01.2019	17.4
5	22.01.2019	18.0

Now, we apply the “left_by” parameter and assign it the value column label “Category“:

pd.merge_ordered(df1, df2, on='Date', left_by="Category")

	Date	Price_x	Category	Price_y
0	15.01.2019	16.7	A	14.6
1	16.01.2019	18.4	A	NaN
2	17.01.2019	20.0	A	19.8
3	18.01.2019	NaN	A	21.9
4	20.01.2019	NaN	A	20.2
5	21.01.2019	NaN	A	17.4
6	22.01.2019	NaN	A	18.0
7	15.01.2019	NaN	B	14.6
8	17.01.2019	NaN	B	19.8
9	18.01.2019	19.3	B	21.9
10	19.01.2019	17.1	B	NaN
11	20.01.2019	21.2	B	20.2
12	21.01.2019	NaN	B	17.4
13	22.01.2019	NaN	B	18.0

This way, we group the left data frame by the “Category” column and merge that piece by piece with the right data frame.

When we look at the resulting data frame, we can observe that, for example, the “Price_x” entry for the date "18.01.2019" in row 3 is “NaN” although there is an entry for that date in “df1“. However, in “df1“, the date is assigned to the category “B“. So, in the merged data frame, the “Price_x” value for the date "18.01.2019" is found in row 9 with category “B“.

If we had a group column in the right data frame, we could do the same with the “right_by” parameter.

Summary

All in all, we learned how to use the Pandas function merge_ordered(). We saw how to apply the various parameters, how to use the different types of joins, and how to group by group columns.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!