Working with date-time in Pandas

In this article, we will see how to work with date-time in Pandas. We will learn how to convert strings into date-time objects, how to create date ranges in various ways, how to work with absolute time units, and how to restructure our date values using several Pandas functions.

Why work with date-time?

Before we get to the actual coding, let’s first clarify why it is important to be able to work with date-time when using Pandas.

Whether it’s about stocks, weather, or the price development of Bitcoin. You name it. In all sorts of datasets, we find date-time values. Which day did the stock reach its highest value? At what time of the day blew the wind in Canada the strongest? The list goes on and on.

That’s why it is critical to be capable to handle date-time values properly. Luckily for us, Pandas provides us with a wide range of functions to deal with date-time values which we will get to know in the following sections.

Convert an argument into a Pandas date-time object

Now that we have seen why it is important to work with date-time values, we will get to the actual coding in this section. To start things off, let’s have a look at a simple data frame example:

import pandas as pd
df = pd.DataFrame({'Date': ["2021-04-01", "2021-04-15", "2021-05-01", "2021-05-15"], 'Num': range(4)})

This is how the DataFrame df looks:

DateNum
02021-04-010
12021-04-151
22021-05-012
32021-05-153

We import the Pandas library first and create a Pandas data frame afterward. The data frame contains a “Date” column with four individual dates. Also, we get a “Num” column that represents numbers from 0 to 3. This is a simplified example of real-world data where we have date values and some numbers assigned to each date.

Let’s check the data types of the columns:

>>> df.dtypes
Date    object
Num      int64
dtype: object

As we can see, the “Date” column is an “object” data type. The “object” data type refers to strings and mixed data types, so Pandas does not recognize the “Date” column as a date-time data type. Luckily, Pandas provides us with the to_datetime() function:

df['Date'] = pd.to_datetime(df['Date'])

This line of code sets the “Date” column’s data type to the date-time data type.

We can approve this by checking the data frame’s data types once again:

>>> df.dtypes
Date    datetime64[ns]
Num              int64
dtype: object

As we can see, we successfully set the “Date” column’s data type to “datetime64[ns]”.

To work with this data frame conveniently, we can set the “Date” column as the index of the data frame. We achieve this by applying the set_index() method:

df = df.set_index('Date')

The DataFrame:

Num
Date
2021-04-010
2021-04-151
2021-05-012
2021-05-153

Now, we can perform all sorts of calculations on this dataset. We will just do one example to show you what’s possible:

df['Num'].loc['2021-05'].sum()
# 5

Here, we calculate the sum of the “Num” column in the month of May. Since the “Date” column is the index of the data frame, we can apply the loc[] function here.

Working with date ranges

In the data frame above, we set the unique dates one by one which was not a big problem because it was a small example with four rows only. When we want to create greater ranges of dates, typing them all in manually can be annoying and is not efficient at all. Therefore, we apply the date_range() function which provides us with various ways to create ranges of dates:

>>> pd.date_range(start='01-01-2021', end='31-01-2021')
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
               '2021-01-09', '2021-01-10', '2021-01-11', '2021-01-12',
               '2021-01-13', '2021-01-14', '2021-01-15', '2021-01-16',
               '2021-01-17', '2021-01-18', '2021-01-19', '2021-01-20',
               '2021-01-21', '2021-01-22', '2021-01-23', '2021-01-24',
               '2021-01-25', '2021-01-26', '2021-01-27', '2021-01-28',
               '2021-01-29', '2021-01-30', '2021-01-31'],
              dtype='datetime64[ns]', freq='D')

In this example, we use the date_range() function with the two parameters “start” and “end” where we set a starting date and an ending date respectively. The output is a DatetimeIndex ranging from 01-01-2021 to 31-01-2021. We also get the data type which is “datetime64[ns]” again and the frequency which is “D” and stands for “days”. This is a lot easier than typing it all in manually.

Instead of an end date, we can also define how many dates we want by applying the “periods” parameter:

>>> pd.date_range(start='01-01-2021', periods=10)
DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
               '2021-01-09', '2021-01-10'],
              dtype='datetime64[ns]', freq='D')

We state the same starting date as before. The “periods” parameter is set to “10”, so we get 10 dates in the outputted DatetimeIndex.

Additionally, we can change the frequency. For example, we might not want to have a day frequency but a monthly one. This is where the “freq” parameter comes into play:

>>> pd.date_range(start='01-01-2021', periods=10, freq='M')
DatetimeIndex(['2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30',
               '2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
               '2021-09-30', '2021-10-31'],
              dtype='datetime64[ns]', freq='M')

We assign the “freq” parameter the value “M” which stands for “month”.

We could also change that to hours, so we get an hourly frequency:

>>> pd.date_range(start='01-01-2021', periods=10, freq='H')
DatetimeIndex(['2021-01-01 00:00:00', '2021-01-01 01:00:00',
               '2021-01-01 02:00:00', '2021-01-01 03:00:00',
               '2021-01-01 04:00:00', '2021-01-01 05:00:00',
               '2021-01-01 06:00:00', '2021-01-01 07:00:00',
               '2021-01-01 08:00:00', '2021-01-01 09:00:00'],
              dtype='datetime64[ns]', freq='H')

The term “H” assigned to the “freq” parameter means “hour”.

Especially when working with stock data or financial data in general, it is often useful to have datasets where only business days (Monday to Friday) play a role. On that account, Pandas provides us with a variation of the date_range() function, namely the bdate_range() function which provides date ranges with business days only:

>>> pd.bdate_range(start='01-01-2021', end='15-01-2021')
DatetimeIndex(['2021-01-01', '2021-01-04', '2021-01-05', '2021-01-06',
               '2021-01-07', '2021-01-08', '2021-01-11', '2021-01-12',
               '2021-01-13', '2021-01-14', '2021-01-15'],
              dtype='datetime64[ns]', freq='B')

We create a date range again starting from 2021-01-01 to 2021-01-15. Note that some days are missing (for example 2021-01-02). The missing dates are the non-business days.

Resample the date-time values

Consider, we are given the following data series:

>>> series = pd.Series(range(9), index=pd.date_range('01-01-2021', periods=9))
>>> series
2021-01-01    0
2021-01-02    1
2021-01-03    2
2021-01-04    3
2021-01-05    4
2021-01-06    5
2021-01-07    6
2021-01-08    7
2021-01-09    8
Freq: D, dtype: int64

We have several dates and for each date, we get a value. Now, we might wonder: What is the mean value for every three days? Is there a simple way to get that information?

As it turns out: Yes, there is!

>>> series.resample('3D').mean()
>>> series
2021-01-01    1.0
2021-01-04    4.0
2021-01-07    7.0
Freq: 3D, dtype: float64

We resample the time series into 3-day episodes and calculate the mean for every three days. The resample() function expects a string that defines how we want to resample the time-series data. “3D” means “3 days”. Instead of the mean, we could also calculate the sum here:

>>> series.resample('3D').sum()
2021-01-01     3
2021-01-04    12
2021-01-07    21
Freq: 3D, dtype: int64

The resample() function is really powerful since it allows us to resample our time data and perform a calculation on that newly structured data with just one line of code.

Inferring the most likely time-frequency

Especially in larger datasets, it might be difficult to tell the frequency of a DatetimeIndex. The Pandas function infer_freq() does what its name suggests: it infers the most likely frequency.

Let’s have a look at the following date range:

range = pd.date_range (start='01-01-2021', end='31-01-2021', periods=31)

We set a starting point at 01-01-2021 and an ending point at 31-01-2021 and we set the “periods” parameter to 31. Since we start on the first day of the month, end on the last day of the month, and have 31 periods, we might guess that the most likely frequency is daily.

We apply the infer_freq() function to see if that’s right:

pd.infer_freq(range)
# 'D'

Since “D” stands for day, we are right!

Working with time deltas

By now, we have seen how to work with specific dates and timestamps. However, sometimes we need to work with absolute time units, for example, "2 days", "14 minutes" etc.

In Pandas, these are called “time deltas”. They represent absolute time differences.

And we use the to_timedelta() function to convert an argument into a “Timedelta” object.

pd.to_timedelta('5 days, 5 minutes')
# Timedelta('5 days 00:05:00')

We assign the to_timedelta() function the string "5 days, 5 minutes" and the function transforms this string into a “Timedelta” object. Note that the "5 minutes" part was transformed into the format “00:05:00”.

Alternatively, we can assign the function a list with multiple absolute time units which are then transformed into “Timedelta” objects:

>>> pd.to_timedelta(['5 days 3 s', '11.3 ms', '3 W'])
TimedeltaIndex(['5 days 00:00:03', '0 days 00:00:00.011300',
                '21 days 00:00:00'],
               dtype='timedelta64[ns]', freq=None)

So, what do we do with time deltas? As mentioned above, time deltas are absolute differences in time. We could, for example, create a data frame with one column representing dates in a daily frequency and another column with time deltas counting the absolute days:

df2 = pd.DataFrame({'Date': pd.date_range(start='01-01-2021', periods=31, freq='D'),
'Number days': pd.timedelta_range(start='1 day', periods=31)})

# df2:
DateNumber days
02021-01-011 days
12021-01-022 days
22021-01-033 days
32021-01-044 days
42021-01-055 days
52021-01-066 days
62021-01-077 days
72021-01-088 days
82021-01-099 days
92021-01-1010 days
     

We create the “Date” column using the date_range() function which we introduced in a previous section. For the “Number days” column, we apply the timedelta_range() function. This function works basically the same as the date_range() function but it does not produce dates or times, but absolute time units such as “2 days” or “3 days” in this case.

Summary

Working with date-time values is a very essential skill since we often find datasets with date and/or time values in them. In this tutorial, we have learned how to apply various date-time functions Pandas provides us with. We learned how to convert arguments into date-time values, how to create several date ranges, how to resample our date values, and how to work with time deltas.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!