Pandas merge_ordered() – A Simple Guide with Video

Rate this post

In this tutorial, we will learn about the Pandas function merge_ordered(). This method performs a merge with optional interpolation. It is especially useful for ordered data like time series data.

Syntax and Parameters

Here are the parameters from the official documentation:

ParameterTypeDescription
leftDataFrame
rightDataFrame
onlabel or listField names to join on. Must be contained in both
DataFrames.
left_onlabel or list, or array-likeField names to join on in the left DataFrame.
right_onlabel or list, or array-likeField names to join on in the right DataFrame.
left_bycolumn name or list of
column names
Group the left DataFrame by group columns and
merge piece by piece with the right DataFrame.
right_bycolumn name or list of
column names
Group the right DataFrame by group columns and
merge piece by piece with the left DataFrame.
fill_method{'ffill', None},
default: None
Interpolation method for data.
suffixeslist-like, default is
("_x", "_y")
A length-2 sequence where each element is
optionally a string indicating the suffix to add to the overlapping column names in left and right respectively. A value of None instead of a string indicates that the column name from left or right should be left as it is. At least one of the values must not be None.
how{'left', 'right', 'outer', 'inner'},
default 'outer'
left: use keys from left data frame only
right: use keys from right data frame only
outer: use union of keys from both data frames
inner: use intersection of keys from both data frames
ReturnsTypeDescription
DataFrameThe merged DataFrame output type will the be same
as ‘left’, if it is a subclass of DataFrame.

Basic Example

To get started, we will create two data frames:

import pandas as pd
df1 = pd.DataFrame({
    'Date': ['15/01/2019', '16/01/2019', '17/01/2019', '18/01/2019', 
    '19/01/2019', '20/01/2019'],
    'Price': [16.7, 18.4, 20.0, 19.3, 17.1, 21.2]
    })
print(df1)
DatePrice
015.01.201916.7
116.01.201918.4
217.01.201920.0
318.01.201919.3
419.01.201917.1
520.01.201921.2
df2 = pd.DataFrame({
    'Date': ['15/01/2019', '17/01/2019', '18/01/2019', '20/01/2019', 
    '21/01/2019', '22/01/2019'],
    'Price': [14.6, 19.8, 21.9, 20.2, 17.4, 18.0]
})

print(df2)
DatePrice
015.01.201914.6
117.01.201919.8
218.01.201921.9
320.01.201920.2
421.01.201917.4
522.01.201918.0

Here, we import the Pandas library as the first step. Then, we create the two data frames “df1” and “df2” which contain a “Date” column and a “Price” column respectively.

Now that we created these data frames, in the next step we can perform our first merge_ordered() operation:

pd.merge_ordered(df1, df2, on='Date')
DatePrice_xPrice_y
015.01.201916.714.6
116.01.201918.4NaN
217.01.201920.019.8
318.01.201919.321.9
419.01.201917.1NaN
520.01.201921.220.2
621.01.2019NaN17.4
722.01.2019NaN18.0

We apply the merge_ordered() function and put in the two data frames as the first two arguments of the function. That’s because these are the data frames that we want to merge. The third parameter is the “on” parameter. This parameter expects the column or a list of columns that we want to perform the merge on. We choose the “Date” column here.

The outputted data frame is longer than each of the two initial data frames. That’s because, by default, the merge_ordered() function performs a so-called “outer” join. That means, we use the union of keys from both our data frames. Since there are eight unique dates, the resulting data frame has eight rows in total.

We also get two price columns: “Price_x” and “Price_y“. For each date, we get a price from the left data frame (“Price_x“) and the right data frame (“Price_y“). If there is a “NaN” value, that means, for this specific date, we have only one price value. For example, for the "16.01.2019", we do not get a “Price_y” value because this date is only found in the first data frame.

The “fill_method” parameter

As we saw in the example above, there were some missing values labeled with “NaN“:

DatePrice_xPrice_y
015.01.201916.714.6
116.01.201918.4NaN
217.01.201920.019.8
318.01.201919.321.9
419.01.201917.1NaN
520.01.201921.220.2
621.01.2019NaN17.4
722.01.2019NaN18.0

We can get rid of these “NaN” values by replacing these values with the previous value. We achieve that by applying the “fill_method” parameter and assigning it to “ffill“:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill')
DatePrice_xPrice_y
015.01.201916.714.6
116.01.201918.414.6
217.01.201920.019.8
318.01.201919.321.9
419.01.201917.121.9
520.01.201921.220.2
621.01.201921.217.4
722.01.201921.218.0

Now, we do not have any more missing values here. For example, the “NaN” value in the “Price_y” column for the date "16.01.2019" was replaced with the previous value from that column (“14.6“).

If multiple values are missing directly one after the other, all missing values get replaced by the last available value. For example, the last two values from the “Price_x” column were missing. They were both replaced by the value of the third last row which was "21.2".

The “suffixes” parameter

In the previous example, the two price columns were named “Price_x” and “Price_y” by default. However, we can change these labels by applying the “suffixes” parameter:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', suffixes=['_leftDF', '_rightDF'])
DatePrice_leftDFPrice_rightDF
015.01.201916.714.6
116.01.201918.414.6
217.01.201920.019.8
318.01.201919.321.9
419.01.201917.121.9
520.01.201921.220.2
621.01.201921.217.4
722.01.201921.218.0

We performed the same merge_ordered() operation as before. But this time, we added the “suffixes” parameter and assigned it a list with the strings “_leftDF” and “_rightDF“. The two price columns in the resulting data frame are now called “Price_leftDF” and “Price_rightDF“.

As the name of the parameter suggests, we only change the suffixes here, not the whole label. That’s why the column labels still say “Price” before the suffixes because the initial column label said “Price” and we only added the suffixes after that label.

The different kinds of joins

As mentioned in the introduction, by default the merge_ordered() function performs an outer join. That means we take the union of keys from both data frames.

But we can change that by using the “how” parameter.

Another type of join is the “inner” join which uses the intersection of keys from both data frames:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='inner')
DatePrice_xPrice_y
015.01.201916.714.6
117.01.201920.019.8
218.01.201919.321.9
320.01.201921.220.2

That means, we only get the dates that are found in both data frames.

The remaining two options the “how” parameter provides us with are the “left” join and the “right” join. The left join uses only the keys from the left data frame.

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='left')
DatePrice_xPrice_y
015.01.201916.714.6
116.01.201918.414.6
217.01.201920.019.8
318.01.201919.321.9
419.01.201917.121.9
520.01.201921.220.2

Whereas the right join only uses the keys from the right data frame.

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='right')
DatePrice_xPrice_y
015.01.201916.714.6
117.01.201920.019.8
218.01.201919.321.9
320.01.201921.220.2
421.01.201921.217.4
522.01.201921.218.0

Grouping by group columns

For this section, we will modify “df1” a little bit:

df1['Category'] = ['A', 'A', 'A', 'B', 'B', 'B']
print(df1)
DatePriceCategory
015.01.201916.7A
116.01.201918.4A
217.01.201920.0A
318.01.201919.3B
419.01.201917.1B
520.01.201921.2B

We added a column called “Category” and assigned the categories “A” or “B” to each row.

df2” remains unchanged:

DatePrice
015.01.201914.6
117.01.201919.8
218.01.201921.9
320.01.201920.2
421.01.201917.4
522.01.201918.0

Now, we apply the “left_by” parameter and assign it the value column label “Category“:

pd.merge_ordered(df1, df2, on='Date', left_by="Category")
DatePrice_xCategoryPrice_y
015.01.201916.7A14.6
116.01.201918.4ANaN
217.01.201920.0A19.8
318.01.2019NaNA21.9
420.01.2019NaNA20.2
521.01.2019NaNA17.4
622.01.2019NaNA18.0
715.01.2019NaNB14.6
817.01.2019NaNB19.8
918.01.201919.3B21.9
1019.01.201917.1BNaN
1120.01.201921.2B20.2
1221.01.2019NaNB17.4
1322.01.2019NaNB18.0

This way, we group the left data frame by the “Category” column and merge that piece by piece with the right data frame.

When we look at the resulting data frame, we can observe that, for example, the “Price_x” entry for the date "18.01.2019" in row 3 is “NaN” although there is an entry for that date in “df1“. However, in “df1“, the date is assigned to the category “B“. So, in the merged data frame, the “Price_x” value for the date "18.01.2019" is found in row 9 with category “B“.

If we had a group column in the right data frame, we could do the same with the “right_by” parameter.

Summary

All in all, we learned how to use the Pandas function merge_ordered(). We saw how to apply the various parameters, how to use the different types of joins, and how to group by group columns.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!