Pandas merge_asof() – A Simple Guide with Video

5/5 - (1 vote)

In this tutorial, we will learn how to apply the merge_asof() function. Described in one sentence, this method performs a merge similar to a left join where we match on near keys instead of equal keys. Thus, the function is especially useful when working with time-series data.

Here are the parameters from the official documentation:

ParameterTypeDescription
leftDataFrame or named Series
rightDataFrame or named Series
onlabelField name to join on. Must be contained in both
DataFrames. Data must be ordered. Must be a numeric column. On or left_on/right_on must be used.
left_onlabelField name to join on in the left DataFrame.
right_onlabelField name to join on in the right DataFrame.
left_indexboolUse the index of the left DataFrame as the join key.
right_indexboolUse the index of the right DataFrame as the join key.
bycolumn name or list of column
names
Match on these columns before performing the
merge operation.
left_bycolumn nameField names to match on in the left DataFrame.
right_bycolumn nameField names to match on in the right DataFrame.
suffixes2-length sequence: tuple, list, etc.Suffix to apply to the overlapping column names in the left and right side, respectively.
toleranceint or Timedelta,
optional, default: None
Select a tolerance within this range. Must be compatible with the merge index.
allow_exact_matchesbool, default TrueIf set to True: allow matching with the same β€˜on’ value (i.e. less-than-or-equal-to / greater-than-or-equal-to)
If set to False: don’t match the same β€˜on’ value (i.e., strictly less-than / strictly greater-than).
direction'backward', 'forward', or
'nearest', default: 'backward'
Whether to search for the prior, subsequent, or
closest matches.
ReturnsType
mergedDataFrame

Basic Functionality of merge_asof()

To get started, we will create two data frames “df1” and “df2“:

import pandas as pd
df1 = pd.DataFrame({
        'time':[pd.Timestamp('2021-12-15 12:00:00'), 
                pd.Timestamp('2021-12-15 12:00:01'), 
                pd.Timestamp('2021-12-15 12:00:05'), 
                pd.Timestamp('2021-12-15 12:00:07'), 
                pd.Timestamp('2021-12-15 12:00:09'), 
                pd.Timestamp('2021-12-15 12:00:12')],
        'price': [12, 14, 8, 7, 11, 15]   
    })

print(df1)

Output:

timeprice
02021-12-15 12:00:0012
12021-12-15 12:00:0114
22021-12-15 12:00:058
32021-12-15 12:00:077
42021-12-15 12:00:0911
52021-12-15 12:00:1215

Let’s have a look at the second DataFrame:

df2 = pd.DataFrame({
        'time':[pd.Timestamp('2021-12-15 12:00:00'), 
                pd.Timestamp('2021-12-15 12:00:02'), 
                pd.Timestamp('2021-12-15 12:00:04'), 
                pd.Timestamp('2021-12-15 12:00:08'), 
                pd.Timestamp('2021-12-15 12:00:10'), 
                pd.Timestamp('2021-12-15 12:00:11')],
        
        'price': [5, 7, 9, 12, 8, 12]   
    })

print(df2)

Output:

timeprice
02021-12-15 12:00:005
12021-12-15 12:00:027
22021-12-15 12:00:049
32021-12-15 12:00:0812
42021-12-15 12:00:108
52021-12-15 12:00:1112

Both data frames contain a “time” column and a “price” column respectively. However, the prices and timestamps from both data frames differ from each other.

Now that we created the data frames, we are ready to do our first merge_asof() operation:

pd.merge_asof(df1, df2, on='time')

Output:

timeprice_xprice_y
02021-12-15 12:00:00125
12021-12-15 12:00:01145
22021-12-15 12:00:0589
32021-12-15 12:00:0779
42021-12-15 12:00:091112
52021-12-15 12:00:121512
    

We put three arguments inside the merge_asof() function. The first two arguments are the two data frames that we want to merge, “df1” and “df2“. The third argument is the “on” parameter which expects the label of the column that we want to merge on. We set this parameter equal to “time“, thus we want to merge on the “time” column.

The resulting data frame has two price columns “price_x” and “price_y“. The “time” column here contains the same timestamps as “df1“. That’s because we set this data frame as the first argument and thus the left data frame. And since the asof merge is similar to a left join, we get the values from the left data frame.

When we take a look at the new price columns, we observe that the “price_x” values equal the price values from “df1“. That’s also the case because “df1” is the left data frame.

The interesting column here is the “price_y” column. The “price_y” value in the first row equals the price value from “df2” in that same row. That’s because the first timestamps from “df1” and “df2” match (they are both "2021-12-15 12:00:00").

However, the timestamps in the second row from both data frames differ from each other ("df1": "2021-12-15 12:00:01", "df2": "2021-12-15 12:00:02").

The “price_y” value in the second row in the resulting data frame is 5 and thus unequal to the price value in “df2” in the same row.

By default, the merge_asof() function performs a backward search. Thus, it takes the price assigned to the backward nearest timestamp from “df2” ("2021-12-15 12:00:00") which is 5 in this case:

timeprice
02021-12-15 12:00:005

Similarly, the “price_y” value in the third row is 9 because the timestamps from “df1” and “df2” in that row don’t match and the function looks backward for the next value.

Since the timestamp for the third row in the resulting data frame is "2021-12-15 12:00:05", it looks for the next value which is backward the nearest to this timestamp in “df2“. The backward nearest timestamp from “df2” is "2021-12-15 12:00:04".

Thus, the function takes this row’s price value:

timeprice
22021-12-15 12:00:049

Parameter “direction”

As mentioned in the previous section, the merge_asof() operation performs a backward search by default because the value for the direction parameter is automatically set to "backward" if not specified otherwise.

The other two options for the direction parameter are "forward" and "nearest". We will start with "forward". As the name suggests, this is the opposite of the backward search, so, we are looking for subsequent matches instead of prior ones.

We will perform the same merge_asof() operation as before, but this time, with a forward search:

pd.merge_asof(df1, df2, on='time', direction='forward')

Output:

timeprice_xprice_y
02021-12-15 12:00:00125.0
12021-12-15 12:00:01147.0
22021-12-15 12:00:05812.0
32021-12-15 12:00:07712.0
42021-12-15 12:00:09118.0
52021-12-15 12:00:1215NaN

And we compare it to the initial merge_asof() operation with a default backward search:

pd.merge_asof(df1, df2, on='time')

Output:

timeprice_xprice_y
02021-12-15 12:00:00125
12021-12-15 12:00:01145
22021-12-15 12:00:0589
32021-12-15 12:00:0779
42021-12-15 12:00:091112
52021-12-15 12:00:121512

The “time” and “price_x” columns remain unchanged. However, some values in the “price_y” column are different now.

For example, the “price_y” value in the second row is now 7 instead of 5. The timestamp "2021-12-15 12:00:01" does not exist in “df2“, so the function now looks for the next timestamp in “df2” instead of the previous one.

The next timestamp is "2021-12-15 12:00:02", so the function takes this row’s price value which is 7.

The last option for the “direction” parameter is "nearest":

pd.merge_asof(df1, df2, on='time', direction='nearest')

Output:

timeprice_xprice_y
02021-12-15 12:00:00125
12021-12-15 12:00:01145
22021-12-15 12:00:0589
32021-12-15 12:00:07712
42021-12-15 12:00:091112
52021-12-15 12:00:121512

The only value that changed here compared to the backward search is the “price_y” value in the fourth row which is now 12 instead of 9.

That’s because the "nearest" search looks for the closest match. The timestamp "2021-12-15 12:00:07" does not exist in “df2“, so the function looks for the timestamp that is closest to "2021-12-15 12:00:07". And that is timestamp "2021-12-15 12:00:08".

So, the function takes this row’s price value which is 12.

Allowing Exact Matches

It might be that we do not want to include exact matches in our merges, for example, if we only want to get values from unique timestamps. Therefore, we apply the “allow_exact_matches” parameter. This parameter expects a boolean value and is set to “True” by default.

Again, we perform the initial merge_asof() operation but this time with the “allow_exact_matches” parameter set to “False“:

pd.merge_asof(df1, df2, on='time', allow_exact_matches=False)

Output:

timeprice_xprice_y
02021-12-15 12:00:0012NaN
12021-12-15 12:00:01145.0
22021-12-15 12:00:0589.0
32021-12-15 12:00:0779.0
42021-12-15 12:00:091112.0
52021-12-15 12:00:121512.0

And we compare it to the initial merge_of() operation:

pd.merge_asof(df1, df2, on='time')

Output:

timeprice_xprice_y
02021-12-15 12:00:00125
12021-12-15 12:00:01145
22021-12-15 12:00:0589
32021-12-15 12:00:0779
42021-12-15 12:00:091112
52021-12-15 12:00:121512

The only value that changed is the first “price_y” value which is “NaN” instead of 5. That’s because the timestamps in the first row of “df1” and in the one in the first row of “df2” match. And since we do not allow exact matches here, the resulting value is “NaN“.

Selecting the Tolerance

The merge_asof() function provides the “tolerance” parameter. Using this parameter, we can determine how much tolerance we want to allow between our timestamps:

pd.merge_asof(df1, df2, on='time', tolerance=pd.Timedelta('1s'))

Output:

timeprice_xprice_y
02021-12-15 12:00:00125.0
12021-12-15 12:00:01145.0
22021-12-15 12:00:0589.0
32021-12-15 12:00:077NaN
42021-12-15 12:00:091112.0
52021-12-15 12:00:121512.0

The initial merge_asof():

pd.merge_asof(df1, df2, on='time')

Output:

timeprice_xprice_y
02021-12-15 12:00:00125
12021-12-15 12:00:01145
22021-12-15 12:00:0589
32021-12-15 12:00:0779
42021-12-15 12:00:091112
52021-12-15 12:00:121512

The only difference is that we apply the “tolerance” parameter and set it equal to a Timedelta of one second.

In the fourth row at the timestamp "2021-12-15 12:00:07" we find a “NaN” value which was 9 in the initial merge_asof() operation. The reason behind that is that the timestamp "2021-12-15 12:00:07" does not exist in “df2“.

So, the merge_asof() looks for the previous timestamp in “df2“. However, the previous timestamp is "2021-12-15 12:00:04" which lies not within the tolerance of one second. Thus, the price value from that row is not used. So, we get a “NaN” value in the resulting data frame.

Matching by a Specific Column

For this section, we will modify “df1” and “df2” a little bit:

df1['category'] = ['A', 'A', 'A', 'A', 'B', 'B']
print(df1)

Output:

timepricecategory
02021-12-15 12:00:0012A
12021-12-15 12:00:0114A
22021-12-15 12:00:058A
32021-12-15 12:00:077A
42021-12-15 12:00:0911B
52021-12-15 12:00:1215B

Also:

df2['category'] = ['A', 'A', 'B', 'B', 'B', 'B']
print(df2)

Output:

timepricecategory
02021-12-15 12:00:005A
12021-12-15 12:00:027A
22021-12-15 12:00:049B
32021-12-15 12:00:0812B
42021-12-15 12:00:108B
52021-12-15 12:00:1112B

We assigned both data frames a “category” column and added the categories “A” and “B”. Notice that the categories are different in “df1” and “df2“.

Now, we perform a merge_asof() operation again and we add the “by” parameter.

pd.merge_asof(df1, df2, on='time', by='category')

Result:

timeprice_xcategoryprice_y
02021-12-15 12:00:0012A5
12021-12-15 12:00:0114A5
22021-12-15 12:00:058A7
32021-12-15 12:00:077A7
42021-12-15 12:00:0911B12
52021-12-15 12:00:1215B12

We merge the data frames on the “time” column and by the “category” column.

This was the initial merge_asof() operation:

pd.merge_asof(df1, df2, on='time')
timeprice_xprice_y
02021-12-15 12:00:00125
12021-12-15 12:00:01145
22021-12-15 12:00:0589
32021-12-15 12:00:0779
42021-12-15 12:00:091112
52021-12-15 12:00:121512

As we can see, some “price_y” values have changed again.

For example, the value in the third row is now 7 instead of 9. The category in the third row in the resulting data frame is “A“. Β And since the timestamp "2021-12-15 12:00:05" from the third row in the resulting data frame does not exist in “df2“, the function looks backward for the next timestamp. The next timestamp backward would be "2021-12-15 12:00:04" and the assigned price value for this timestamp in “df2” is 9. But the category for this timestamp is “B“.

However, the function looks for the next timestamp from the same category which is “A“. And this timestamp is "2021-12-15 12:00:02" with a price value of 7. Thus, the “price_y” value in the merged data frame in the third row is 7 and not 9.

Summary

All in all, we studied the Pandas function merge_asof(). We learned the basic functionality of this function, how to search in different directions, whether to allow exact matches or not, how to specify a tolerance, and how to perform merge_asof() by specific columns.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!