Pandas merge_asof() - A Simple Guide with Video - Be on the Right Side of Change

In this tutorial, we will learn how to apply the merge_asof() function. Described in one sentence, this method performs a merge similar to a left join where we match on near keys instead of equal keys. Thus, the function is especially useful when working with time-series data.

Here are the parameters from the official documentation:

Parameter	Type	Description
`left`	DataFrame or named Series
`right`	DataFrame or named Series
`on`	label	Field name to join on. Must be contained in both DataFrames. Data must be ordered. Must be a numeric column. On or left_on/right_on must be used.
`left_on`	label	Field name to join on in the left DataFrame.
`right_on`	label	Field name to join on in the right DataFrame.
`left_index`	`bool`	Use the index of the left DataFrame as the join key.
`right_index`	`bool`	Use the index of the right DataFrame as the join key.
`by`	column name or list of column names	Match on these columns before performing the merge operation.
`left_by`	column name	Field names to match on in the left DataFrame.
`right_by`	column name	Field names to match on in the right DataFrame.
`suffixes`	2-length sequence: tuple, list, etc.	Suffix to apply to the overlapping column names in the left and right side, respectively.
`tolerance`	`int` or `Timedelta`, optional, default: `None`	Select a tolerance within this range. Must be compatible with the merge index.
`allow_exact_matches`	bool, default `True`	If set to `True`: allow matching with the same ‘`on`’ value (i.e. less-th a n-or-equal-to / greater-than-or-equal-to) If set to `False`: don’t match the same ‘`on`’ value (i.e., strictly less-than / strictly greater-than).
`direction`	`'backward'`, `'forward'`, or `'nearest'`, default: `'backward'`	Whether to search for the prior, subsequent, or closest matches.

Returns	Type
merged	DataFrame

Basic Functionality of merge_asof()

To get started, we will create two data frames “df1” and “df2“:

import pandas as pd
df1 = pd.DataFrame({
        'time':[pd.Timestamp('2021-12-15 12:00:00'), 
                pd.Timestamp('2021-12-15 12:00:01'), 
                pd.Timestamp('2021-12-15 12:00:05'), 
                pd.Timestamp('2021-12-15 12:00:07'), 
                pd.Timestamp('2021-12-15 12:00:09'), 
                pd.Timestamp('2021-12-15 12:00:12')],
        'price': [12, 14, 8, 7, 11, 15]   
    })

print(df1)

Output:

	time	price
0	2021-12-15 12:00:00	12
1	2021-12-15 12:00:01	14
2	2021-12-15 12:00:05	8
3	2021-12-15 12:00:07	7
4	2021-12-15 12:00:09	11
5	2021-12-15 12:00:12	15

Let’s have a look at the second DataFrame:

df2 = pd.DataFrame({
        'time':[pd.Timestamp('2021-12-15 12:00:00'), 
                pd.Timestamp('2021-12-15 12:00:02'), 
                pd.Timestamp('2021-12-15 12:00:04'), 
                pd.Timestamp('2021-12-15 12:00:08'), 
                pd.Timestamp('2021-12-15 12:00:10'), 
                pd.Timestamp('2021-12-15 12:00:11')],
        
        'price': [5, 7, 9, 12, 8, 12]   
    })

print(df2)

Output:

	time	price
0	2021-12-15 12:00:00	5
1	2021-12-15 12:00:02	7
2	2021-12-15 12:00:04	9
3	2021-12-15 12:00:08	12
4	2021-12-15 12:00:10	8
5	2021-12-15 12:00:11	12

Both data frames contain a “time” column and a “price” column respectively. However, the prices and timestamps from both data frames differ from each other.

Now that we created the data frames, we are ready to do our first merge_asof() operation:

pd.merge_asof(df1, df2, on='time')

Output:

	time	price_x	price_y
0	2021-12-15 12:00:00	12	5
1	2021-12-15 12:00:01	14	5
2	2021-12-15 12:00:05	8	9
3	2021-12-15 12:00:07	7	9
4	2021-12-15 12:00:09	11	12
5	2021-12-15 12:00:12	15	12

We put three arguments inside the merge_asof() function. The first two arguments are the two data frames that we want to merge, “df1” and “df2“. The third argument is the “on” parameter which expects the label of the column that we want to merge on. We set this parameter equal to “time“, thus we want to merge on the “time” column.

The resulting data frame has two price columns “price_x” and “price_y“. The “time” column here contains the same timestamps as “df1“. That’s because we set this data frame as the first argument and thus the left data frame. And since the asof merge is similar to a left join, we get the values from the left data frame.

When we take a look at the new price columns, we observe that the “price_x” values equal the price values from “df1“. That’s also the case because “df1” is the left data frame.

The interesting column here is the “price_y” column. The “price_y” value in the first row equals the price value from “df2” in that same row. That’s because the first timestamps from “df1” and “df2” match (they are both "2021-12-15 12:00:00").

However, the timestamps in the second row from both data frames differ from each other ("df1": "2021-12-15 12:00:01", "df2": "2021-12-15 12:00:02").

The “price_y” value in the second row in the resulting data frame is 5 and thus unequal to the price value in “df2” in the same row.

By default, the merge_asof() function performs a backward search. Thus, it takes the price assigned to the backward nearest timestamp from “df2” ("2021-12-15 12:00:00") which is 5 in this case:

	time	price
0	2021-12-15 12:00:00	5

Similarly, the “price_y” value in the third row is 9 because the timestamps from “df1” and “df2” in that row don’t match and the function looks backward for the next value.

Since the timestamp for the third row in the resulting data frame is "2021-12-15 12:00:05", it looks for the next value which is backward the nearest to this timestamp in “df2“. The backward nearest timestamp from “df2” is "2021-12-15 12:00:04".

Thus, the function takes this row’s price value:

	time	price
2	2021-12-15 12:00:04	9

Parameter “direction”

As mentioned in the previous section, the merge_asof() operation performs a backward search by default because the value for the direction parameter is automatically set to "backward" if not specified otherwise.

The other two options for the direction parameter are "forward" and "nearest". We will start with "forward". As the name suggests, this is the opposite of the backward search, so, we are looking for subsequent matches instead of prior ones.

We will perform the same merge_asof() operation as before, but this time, with a forward search:

pd.merge_asof(df1, df2, on='time', direction='forward')

Output:

	time	price_x	price_y
0	2021-12-15 12:00:00	12	5.0
1	2021-12-15 12:00:01	14	7.0
2	2021-12-15 12:00:05	8	12.0
3	2021-12-15 12:00:07	7	12.0
4	2021-12-15 12:00:09	11	8.0
5	2021-12-15 12:00:12	15	NaN

And we compare it to the initial merge_asof() operation with a default backward search:

pd.merge_asof(df1, df2, on='time')

Output:

	time	price_x	price_y
0	2021-12-15 12:00:00	12	5
1	2021-12-15 12:00:01	14	5
2	2021-12-15 12:00:05	8	9
3	2021-12-15 12:00:07	7	9
4	2021-12-15 12:00:09	11	12
5	2021-12-15 12:00:12	15	12

The “time” and “price_x” columns remain unchanged. However, some values in the “price_y” column are different now.

For example, the “price_y” value in the second row is now 7 instead of 5. The timestamp "2021-12-15 12:00:01" does not exist in “df2“, so the function now looks for the next timestamp in “df2” instead of the previous one.

The next timestamp is "2021-12-15 12:00:02", so the function takes this row’s price value which is 7.

The last option for the “direction” parameter is "nearest":

pd.merge_asof(df1, df2, on='time', direction='nearest')

Output:

	time	price_x	price_y
0	2021-12-15 12:00:00	12	5
1	2021-12-15 12:00:01	14	5
2	2021-12-15 12:00:05	8	9
3	2021-12-15 12:00:07	7	12
4	2021-12-15 12:00:09	11	12
5	2021-12-15 12:00:12	15	12

The only value that changed here compared to the backward search is the “price_y” value in the fourth row which is now 12 instead of 9.

That’s because the "nearest" search looks for the closest match. The timestamp "2021-12-15 12:00:07" does not exist in “df2“, so the function looks for the timestamp that is closest to "2021-12-15 12:00:07". And that is timestamp "2021-12-15 12:00:08".

So, the function takes this row’s price value which is 12.

Allowing Exact Matches

It might be that we do not want to include exact matches in our merges, for example, if we only want to get values from unique timestamps. Therefore, we apply the “allow_exact_matches” parameter. This parameter expects a boolean value and is set to “True” by default.

Again, we perform the initial merge_asof() operation but this time with the “allow_exact_matches” parameter set to “False“:

pd.merge_asof(df1, df2, on='time', allow_exact_matches=False)

Output:

	time	price_x	price_y
0	2021-12-15 12:00:00	12	NaN
1	2021-12-15 12:00:01	14	5.0
2	2021-12-15 12:00:05	8	9.0
3	2021-12-15 12:00:07	7	9.0
4	2021-12-15 12:00:09	11	12.0
5	2021-12-15 12:00:12	15	12.0

And we compare it to the initial merge_of() operation:

pd.merge_asof(df1, df2, on='time')

Output:

	time	price_x	price_y
0	2021-12-15 12:00:00	12	5
1	2021-12-15 12:00:01	14	5
2	2021-12-15 12:00:05	8	9
3	2021-12-15 12:00:07	7	9
4	2021-12-15 12:00:09	11	12
5	2021-12-15 12:00:12	15	12

The only value that changed is the first “price_y” value which is “NaN” instead of 5. That’s because the timestamps in the first row of “df1” and in the one in the first row of “df2” match. And since we do not allow exact matches here, the resulting value is “NaN“.

Selecting the Tolerance

The merge_asof() function provides the “tolerance” parameter. Using this parameter, we can determine how much tolerance we want to allow between our timestamps:

pd.merge_asof(df1, df2, on='time', tolerance=pd.Timedelta('1s'))

Output:

	time	price_x	price_y
0	2021-12-15 12:00:00	12	5.0
1	2021-12-15 12:00:01	14	5.0
2	2021-12-15 12:00:05	8	9.0
3	2021-12-15 12:00:07	7	NaN
4	2021-12-15 12:00:09	11	12.0
5	2021-12-15 12:00:12	15	12.0

The initial merge_asof():

pd.merge_asof(df1, df2, on='time')

Output:

	time	price_x	price_y
0	2021-12-15 12:00:00	12	5
1	2021-12-15 12:00:01	14	5
2	2021-12-15 12:00:05	8	9
3	2021-12-15 12:00:07	7	9
4	2021-12-15 12:00:09	11	12
5	2021-12-15 12:00:12	15	12

The only difference is that we apply the “tolerance” parameter and set it equal to a Timedelta of one second.

In the fourth row at the timestamp "2021-12-15 12:00:07" we find a “NaN” value which was 9 in the initial merge_asof() operation. The reason behind that is that the timestamp "2021-12-15 12:00:07" does not exist in “df2“.

So, the merge_asof() looks for the previous timestamp in “df2“. However, the previous timestamp is "2021-12-15 12:00:04" which lies not within the tolerance of one second. Thus, the price value from that row is not used. So, we get a “NaN” value in the resulting data frame.

Matching by a Specific Column

For this section, we will modify “df1” and “df2” a little bit:

df1['category'] = ['A', 'A', 'A', 'A', 'B', 'B']
print(df1)

Output:

	time	price	category
0	2021-12-15 12:00:00	12	A
1	2021-12-15 12:00:01	14	A
2	2021-12-15 12:00:05	8	A
3	2021-12-15 12:00:07	7	A
4	2021-12-15 12:00:09	11	B
5	2021-12-15 12:00:12	15	B

Also:

df2['category'] = ['A', 'A', 'B', 'B', 'B', 'B']
print(df2)

Output:

	time	price	category
0	2021-12-15 12:00:00	5	A
1	2021-12-15 12:00:02	7	A
2	2021-12-15 12:00:04	9	B
3	2021-12-15 12:00:08	12	B
4	2021-12-15 12:00:10	8	B
5	2021-12-15 12:00:11	12	B

We assigned both data frames a “category” column and added the categories “A” and “B”. Notice that the categories are different in “df1” and “df2“.

Now, we perform a merge_asof() operation again and we add the “by” parameter.

pd.merge_asof(df1, df2, on='time', by='category')

Result:

	time	price_x	category	price_y
0	2021-12-15 12:00:00	12	A	5
1	2021-12-15 12:00:01	14	A	5
2	2021-12-15 12:00:05	8	A	7
3	2021-12-15 12:00:07	7	A	7
4	2021-12-15 12:00:09	11	B	12
5	2021-12-15 12:00:12	15	B	12

We merge the data frames on the “time” column and by the “category” column.

This was the initial merge_asof() operation:

pd.merge_asof(df1, df2, on='time')

	time	price_x	price_y
0	2021-12-15 12:00:00	12	5
1	2021-12-15 12:00:01	14	5
2	2021-12-15 12:00:05	8	9
3	2021-12-15 12:00:07	7	9
4	2021-12-15 12:00:09	11	12
5	2021-12-15 12:00:12	15	12

As we can see, some “price_y” values have changed again.

For example, the value in the third row is now 7 instead of 9. The category in the third row in the resulting data frame is “A“. And since the timestamp "2021-12-15 12:00:05" from the third row in the resulting data frame does not exist in “df2“, the function looks backward for the next timestamp. The next timestamp backward would be "2021-12-15 12:00:04" and the assigned price value for this timestamp in “df2” is 9. But the category for this timestamp is “B“.

However, the function looks for the next timestamp from the same category which is “A“. And this timestamp is "2021-12-15 12:00:02" with a price value of 7. Thus, the “price_y” value in the merged data frame in the third row is 7 and not 9.

Summary

All in all, we studied the Pandas function merge_asof(). We learned the basic functionality of this function, how to search in different directions, whether to allow exact matches or not, how to specify a tolerance, and how to perform merge_asof() by specific columns.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!