In this tutorial, we will learn how to apply the merge_asof()
function. Described in one sentence, this method performs a merge similar to a left join where we match on near keys instead of equal keys. Thus, the function is especially useful when working with time-series data.
Here are the parameters from the official documentation:
Parameter | Type | Description |
left | DataFrame or named Series | |
right | DataFrame or named Series | |
on | label | Field name to join on. Must be contained in both DataFrames. Data must be ordered. Must be a numeric column. On or left_on/right_on must be used. |
left_on | label | Field name to join on in the left DataFrame. |
right_on | label | Field name to join on in the right DataFrame. |
left_index | bool | Use the index of the left DataFrame as the join key. |
right_index | bool | Use the index of the right DataFrame as the join key. |
by | column name or list of column names | Match on these columns before performing the merge operation. |
left_by | column name | Field names to match on in the left DataFrame. |
right_by | column name | Field names to match on in the right DataFrame. |
suffixes | 2-length sequence: tuple, list, etc. | Suffix to apply to the overlapping column names in the left and right side, respectively. |
tolerance | int or Timedelta ,optional, default: None | Select a tolerance within this range. Must be compatible with the merge index. |
allow_exact_matches | bool, default True | If set to True : allow matching with the same βon β value (i.e. less-than-or-equal-to / greater-than-or-equal-to)If set to False : donβt match the same βon β value (i.e., strictly less-than / strictly greater-than). |
direction | 'backward' , 'forward' , or'nearest' , default: 'backward' | Whether to search for the prior, subsequent, or closest matches. |
Returns | Type | |
merged | DataFrame |
Basic Functionality of merge_asof()
To get started, we will create two data frames “df1
” and “df2
“:
import pandas as pd df1 = pd.DataFrame({ 'time':[pd.Timestamp('2021-12-15 12:00:00'), pd.Timestamp('2021-12-15 12:00:01'), pd.Timestamp('2021-12-15 12:00:05'), pd.Timestamp('2021-12-15 12:00:07'), pd.Timestamp('2021-12-15 12:00:09'), pd.Timestamp('2021-12-15 12:00:12')], 'price': [12, 14, 8, 7, 11, 15] }) print(df1)
Output:
time | price | |
0 | 2021-12-15 12:00:00 | 12 |
1 | 2021-12-15 12:00:01 | 14 |
2 | 2021-12-15 12:00:05 | 8 |
3 | 2021-12-15 12:00:07 | 7 |
4 | 2021-12-15 12:00:09 | 11 |
5 | 2021-12-15 12:00:12 | 15 |
Let’s have a look at the second DataFrame:
df2 = pd.DataFrame({ 'time':[pd.Timestamp('2021-12-15 12:00:00'), pd.Timestamp('2021-12-15 12:00:02'), pd.Timestamp('2021-12-15 12:00:04'), pd.Timestamp('2021-12-15 12:00:08'), pd.Timestamp('2021-12-15 12:00:10'), pd.Timestamp('2021-12-15 12:00:11')], 'price': [5, 7, 9, 12, 8, 12] }) print(df2)
Output:
time | price | |
0 | 2021-12-15 12:00:00 | 5 |
1 | 2021-12-15 12:00:02 | 7 |
2 | 2021-12-15 12:00:04 | 9 |
3 | 2021-12-15 12:00:08 | 12 |
4 | 2021-12-15 12:00:10 | 8 |
5 | 2021-12-15 12:00:11 | 12 |
Both data frames contain a “time
” column and a “price
” column respectively. However, the prices and timestamps from both data frames differ from each other.
Now that we created the data frames, we are ready to do our first merge_asof()
operation:
pd.merge_asof(df1, df2, on='time')
Output:
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | 5 |
1 | 2021-12-15 12:00:01 | 14 | 5 |
2 | 2021-12-15 12:00:05 | 8 | 9 |
3 | 2021-12-15 12:00:07 | 7 | 9 |
4 | 2021-12-15 12:00:09 | 11 | 12 |
5 | 2021-12-15 12:00:12 | 15 | 12 |
We put three arguments inside the merge_asof()
function. The first two arguments are the two data frames that we want to merge, “df1
” and “df2
“. The third argument is the “on
” parameter which expects the label of the column that we want to merge on. We set this parameter equal to “time
“, thus we want to merge on the “time
” column.
The resulting data frame has two price columns “price_x
” and “price_y
“. The “time
” column here contains the same timestamps as “df1
“. That’s because we set this data frame as the first argument and thus the left data frame. And since the asof merge is similar to a left join, we get the values from the left data frame.
When we take a look at the new price columns, we observe that the “price_x
” values equal the price values from “df1
“. That’s also the case because “df1
” is the left data frame.
The interesting column here is the “price_y
” column. The “price_y
” value in the first row equals the price value from “df2
” in that same row. That’s because the first timestamps from “df1
” and “df2
” match (they are both "2021-12-15 12:00:00"
).
However, the timestamps in the second row from both data frames differ from each other ("df1": "2021-12-15 12:00:01", "df2": "2021-12-15 12:00:02"
).
The “price_y
” value in the second row in the resulting data frame is 5
and thus unequal to the price value in “df2
” in the same row.
By default, the merge_asof()
function performs a backward search. Thus, it takes the price assigned to the backward nearest timestamp from “df2
” ("2021-12-15 12:00:00"
) which is 5
in this case:
time | price | |
0 | 2021-12-15 12:00:00 | 5 |
Similarly, the “price_y
” value in the third row is 9
because the timestamps from “df1
” and “df2
” in that row don’t match and the function looks backward for the next value.
Since the timestamp for the third row in the resulting data frame is "2021-12-15 12:00:05"
, it looks for the next value which is backward the nearest to this timestamp in “df2
“. The backward nearest timestamp from “df2
” is "2021-12-15 12:00:04"
.
Thus, the function takes this row’s price value:
time | price | |
2 | 2021-12-15 12:00:04 | 9 |
Parameter “direction”
As mentioned in the previous section, the merge_asof()
operation performs a backward search by default because the value for the direction
parameter is automatically set to "backward"
if not specified otherwise.
The other two options for the direction
parameter are "forward"
and "nearest"
. We will start with "forward"
. As the name suggests, this is the opposite of the backward search, so, we are looking for subsequent matches instead of prior ones.
We will perform the same merge_asof()
operation as before, but this time, with a forward search:
pd.merge_asof(df1, df2, on='time', direction='forward')
Output:
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | 5.0 |
1 | 2021-12-15 12:00:01 | 14 | 7.0 |
2 | 2021-12-15 12:00:05 | 8 | 12.0 |
3 | 2021-12-15 12:00:07 | 7 | 12.0 |
4 | 2021-12-15 12:00:09 | 11 | 8.0 |
5 | 2021-12-15 12:00:12 | 15 | NaN |
And we compare it to the initial merge_asof()
operation with a default backward search:
pd.merge_asof(df1, df2, on='time')
Output:
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | 5 |
1 | 2021-12-15 12:00:01 | 14 | 5 |
2 | 2021-12-15 12:00:05 | 8 | 9 |
3 | 2021-12-15 12:00:07 | 7 | 9 |
4 | 2021-12-15 12:00:09 | 11 | 12 |
5 | 2021-12-15 12:00:12 | 15 | 12 |
The “time
” and “price_x
” columns remain unchanged. However, some values in the “price_y
” column are different now.
For example, the “price_y
” value in the second row is now 7 instead of 5. The timestamp "2021-12-15 12:00:01"
does not exist in “df2
“, so the function now looks for the next timestamp in “df2
” instead of the previous one.
The next timestamp is "2021-12-15 12:00:02"
, so the function takes this row’s price value which is 7.
The last option for the “direction
” parameter is "nearest"
:
pd.merge_asof(df1, df2, on='time', direction='nearest')
Output:
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | 5 |
1 | 2021-12-15 12:00:01 | 14 | 5 |
2 | 2021-12-15 12:00:05 | 8 | 9 |
3 | 2021-12-15 12:00:07 | 7 | 12 |
4 | 2021-12-15 12:00:09 | 11 | 12 |
5 | 2021-12-15 12:00:12 | 15 | 12 |
The only value that changed here compared to the backward search is the “price_y
” value in the fourth row which is now 12 instead of 9.
That’s because the "nearest"
search looks for the closest match. The timestamp "2021-12-15 12:00:07"
does not exist in “df2
“, so the function looks for the timestamp that is closest to "2021-12-15 12:00:07"
. And that is timestamp "2021-12-15 12:00:08"
.
So, the function takes this row’s price value which is 12.
Allowing Exact Matches
It might be that we do not want to include exact matches in our merges, for example, if we only want to get values from unique timestamps. Therefore, we apply the “allow_exact_matches
” parameter. This parameter expects a boolean value and is set to “True
” by default.
Again, we perform the initial merge_asof()
operation but this time with the “allow_exact_matches
” parameter set to “False
“:
pd.merge_asof(df1, df2, on='time', allow_exact_matches=False)
Output:
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | NaN |
1 | 2021-12-15 12:00:01 | 14 | 5.0 |
2 | 2021-12-15 12:00:05 | 8 | 9.0 |
3 | 2021-12-15 12:00:07 | 7 | 9.0 |
4 | 2021-12-15 12:00:09 | 11 | 12.0 |
5 | 2021-12-15 12:00:12 | 15 | 12.0 |
And we compare it to the initial merge_of()
operation:
pd.merge_asof(df1, df2, on='time')
Output:
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | 5 |
1 | 2021-12-15 12:00:01 | 14 | 5 |
2 | 2021-12-15 12:00:05 | 8 | 9 |
3 | 2021-12-15 12:00:07 | 7 | 9 |
4 | 2021-12-15 12:00:09 | 11 | 12 |
5 | 2021-12-15 12:00:12 | 15 | 12 |
The only value that changed is the first “price_y
” value which is “NaN
” instead of 5. That’s because the timestamps in the first row of “df1
” and in the one in the first row of “df2
” match. And since we do not allow exact matches here, the resulting value is “NaN
“.
Selecting the Tolerance
The merge_asof()
function provides the “tolerance
” parameter. Using this parameter, we can determine how much tolerance we want to allow between our timestamps:
pd.merge_asof(df1, df2, on='time', tolerance=pd.Timedelta('1s'))
Output:
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | 5.0 |
1 | 2021-12-15 12:00:01 | 14 | 5.0 |
2 | 2021-12-15 12:00:05 | 8 | 9.0 |
3 | 2021-12-15 12:00:07 | 7 | NaN |
4 | 2021-12-15 12:00:09 | 11 | 12.0 |
5 | 2021-12-15 12:00:12 | 15 | 12.0 |
The initial merge_asof()
:
pd.merge_asof(df1, df2, on='time')
Output:
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | 5 |
1 | 2021-12-15 12:00:01 | 14 | 5 |
2 | 2021-12-15 12:00:05 | 8 | 9 |
3 | 2021-12-15 12:00:07 | 7 | 9 |
4 | 2021-12-15 12:00:09 | 11 | 12 |
5 | 2021-12-15 12:00:12 | 15 | 12 |
The only difference is that we apply the “tolerance
” parameter and set it equal to a Timedelta
of one second.
In the fourth row at the timestamp "2021-12-15 12:00:07"
we find a “NaN
” value which was 9 in the initial merge_asof()
operation. The reason behind that is that the timestamp "2021-12-15 12:00:07"
does not exist in “df2
“.
So, the merge_asof()
looks for the previous timestamp in “df2
“. However, the previous timestamp is "2021-12-15 12:00:04"
which lies not within the tolerance of one second. Thus, the price value from that row is not used. So, we get a “NaN
” value in the resulting data frame.
Matching by a Specific Column
For this section, we will modify “df1
” and “df2
” a little bit:
df1['category'] = ['A', 'A', 'A', 'A', 'B', 'B'] print(df1)
Output:
time | price | category | |
0 | 2021-12-15 12:00:00 | 12 | A |
1 | 2021-12-15 12:00:01 | 14 | A |
2 | 2021-12-15 12:00:05 | 8 | A |
3 | 2021-12-15 12:00:07 | 7 | A |
4 | 2021-12-15 12:00:09 | 11 | B |
5 | 2021-12-15 12:00:12 | 15 | B |
Also:
df2['category'] = ['A', 'A', 'B', 'B', 'B', 'B'] print(df2)
Output:
time | price | category | |
0 | 2021-12-15 12:00:00 | 5 | A |
1 | 2021-12-15 12:00:02 | 7 | A |
2 | 2021-12-15 12:00:04 | 9 | B |
3 | 2021-12-15 12:00:08 | 12 | B |
4 | 2021-12-15 12:00:10 | 8 | B |
5 | 2021-12-15 12:00:11 | 12 | B |
We assigned both data frames a “category
” column and added the categories “A” and “B”. Notice that the categories are different in “df1
” and “df2
“.
Now, we perform a merge_asof()
operation again and we add the “by
” parameter.
pd.merge_asof(df1, df2, on='time', by='category')
Result:
time | price_x | category | price_y | |
0 | 2021-12-15 12:00:00 | 12 | A | 5 |
1 | 2021-12-15 12:00:01 | 14 | A | 5 |
2 | 2021-12-15 12:00:05 | 8 | A | 7 |
3 | 2021-12-15 12:00:07 | 7 | A | 7 |
4 | 2021-12-15 12:00:09 | 11 | B | 12 |
5 | 2021-12-15 12:00:12 | 15 | B | 12 |
We merge the data frames on the “time
” column and by the “category
” column.
This was the initial merge_as
of() operation:
pd.merge_asof(df1, df2, on='time')
time | price_x | price_y | |
0 | 2021-12-15 12:00:00 | 12 | 5 |
1 | 2021-12-15 12:00:01 | 14 | 5 |
2 | 2021-12-15 12:00:05 | 8 | 9 |
3 | 2021-12-15 12:00:07 | 7 | 9 |
4 | 2021-12-15 12:00:09 | 11 | 12 |
5 | 2021-12-15 12:00:12 | 15 | 12 |
As we can see, some “price_y
” values have changed again.
For example, the value in the third row is now 7 instead of 9. The category in the third row in the resulting data frame is “A
“. Β And since the timestamp "2021-12-15 12:00:05"
from the third row in the resulting data frame does not exist in “df2
“, the function looks backward for the next timestamp. The next timestamp backward would be "2021-12-15 12:00:04"
and the assigned price value for this timestamp in “df2
” is 9. But the category for this timestamp is “B
“.
However, the function looks for the next timestamp from the same category which is “A
“. And this timestamp is "2021-12-15 12:00:02"
with a price value of 7. Thus, the “price_y
” value in the merged data frame in the third row is 7 and not 9.
Summary
All in all, we studied the Pandas function merge_asof()
. We learned the basic functionality of this function, how to search in different directions, whether to allow exact matches or not, how to specify a tolerance, and how to perform merge_asof()
by specific columns.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.
Happy Coding!