Pandas Plotting Autocorrelation - Be on the Right Side of Change

A correlogram is a chart used in data analysis to check for randomness in a data set, hence the name. The less the degree of randomness, the more there is a correlation between the data. The correlogram chart highlights any potential statistical significance between data points.

An autocorrelogram checks for the same degree of correlation or randomness but in data with a discrete-time sequence between the data points. Such data is known as time-series data. Also known as serial correlation, autocorrelation compares data with a delayed copy of itself to identify possible repeating patterns.

This article will investigate the Pandas Plotting module function, autocorrelation_plot(), and use it to create an autocorrelation plot for some time-series data.

Building Our First Autocorrelogram

There are three main steps to creating an autocorrelogram in Python. First, we need to create or access some time-series data. We’ll manually create a small dataset showing 10-years of annual household gross income for this first example. We’ll also show annual household expenditure. Of course, you could use the Pandas read_csv() function or one of the other Pandas functions to import data from a separate file; however, for this example we’ll simply create the data in-code. Here’s the data.

income_vs_expenditure_timeseries = {
    'Date': ['2012', '2013', '2014', '2015', '2016', 
             '2017', '2018', '2019', '2020', '2021'],
    'Household Income': [65000, 70850, 77227, 84177, 91753,  
                         100011, 109012, 118823, 129517, 141173],
    'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 
                              71630, 77360, 83549, 90233, 97451],
}

So we have a time series in the form of a dictionary, showing the year, annual household income, and annual household expenditure for 10-years. If your household is anything like mine, the more you earn, the more you spend. Therefore, I’ve created these fictitious sums to show a similar trend.

For the next step, we need to import Pandas and create a data frame from the time series data we have just entered. We’ll also need to turn the date information from a string into a format recognisable as a date. Finally, we will set the index for our plot to be the Date column, as our income and expenditure figures are annual.

import pandas as pd

income_vs_expenditure_timeseries = {

    'Date': ['2012', '2013', '2014', '2015', '2016', 
             '2017', '2018', '2019', '2020', '2021'],

    'Household Income': [65000, 70850, 77227, 84177, 91753, 
                         100011, 109012, 118823, 129517, 141173],

    'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 
                              71630, 77360, 83549, 90233, 97451],
}

dataframe = pd.DataFrame(income_vs_expenditure_timeseries)

dataframe["Date"] = dataframe["Date"].astype("datetime64")

dataframe = dataframe.set_index("Date")

In this previous code, we’ve imported Pandas with an alias of pd. The time-series data is as you saw it earlier. Then we used the pd.dataframe() function to create a dataframe from our time series.

Next, we used the astype() function to change the date column to an integer from a string. Function astype() returns a new data frame with the named data type changed to the specified format. You can use astype() to change the entire data frame, or when using a dictionary as we are, allocate a data type to each column. Finally, the set_index() function makes the Date column the dataframe index.

Now we’ll add a print command to see the appearance of our data frame.

import pandas as pd

income_vs_expenditure_timeseries = {

    'Date': ['2012', '2013', '2014', '2015', '2016', 
             '2017', '2018', '2019', '2020', '2021'],

    'Household Income': [65000, 70850, 77227, 84177, 91753, 100011, 
                         109012, 118823, 129517, 141173],

    'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 
                              71630, 77360, 83549, 90233, 97451],
}

dataframe = pd.DataFrame(income_vs_expenditure_timeseries)

dataframe["Date"] = dataframe["Date"].astype("datetime64")

dataframe = dataframe.set_index("Date")

print(dataframe)

# Result
'''
            	Household Income  	Household Expenditure
Date                                               
2012-01-01             65000                 	 	48750
2013-01-01             70850                  		52650
2014-01-01             77227                 	 	56862
2015-01-01             84177                  		61411
2016-01-01             91753                  		66324
2017-01-01            100011                 		71630
2018-01-01            109012                  	        77360
2019-01-01            118823                  	        83549
2020-01-01            129517                 		90233
2021-01-01            141173                  	        97451
'''

With the data frame completed to our satisfaction, we’ll import the Matplotlib Pyplot module to allow us to configure and show our final plot. Before we create an autocorrelogram, let’s look at a regular plot of our data frame to see what the data shows us.

import pandas as pd
import matplotlib.pyplot as plt

income_vs_expenditure_timeseries = {

    'Date': ['2012', '2013', '2014', '2015', '2016', '2017', 
             '2018', '2019', '2020', '2021'],
    'Household Income': [65000, 70850, 77227, 84177, 91753, 
                         100011, 109012, 118823, 129517, 141173],
    'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 
                              71630, 77360, 83549, 90233, 97451],
}

# Preparing the data frame with Pandas
dataframe = pd.DataFrame(income_vs_expenditure_timeseries)

dataframe["Date"] = dataframe["Date"].astype("datetime64")

dataframe = dataframe.set_index("Date")

# Preparing and showing a regular plot with Matplotlib Pyplot
plt.xlabel("Date")

plt.ylabel("Values")

plt.plot(dataframe)

plt.show()

In this code, we imported matplotlib.pyplot as the alias, plt. We then created labels for the x and y axes of our plot, before plotting the result. Here’s what our data frame looks like in graphical format.

This chart shows the apparent correlation between the blue line, which is household income, and the orange household expenditure line. Each year, as the income has increased, so too has our expenditure. While the expenditure is not on a one-for-one basis, there is an obvious positive correlation between the two.

While the correlation between our simple data is evident from a casual glance, a comprehensive time-series dataset can be anything but obvious. In complex cases, the autocorrelation graph will better identify seasonality and correlation. Now we know there is a correlation in our simple data set, let’s see what an autocorrelation plot will tell us.

Using pandas.plotting.autocorrelation_plot() Function

The Pandas plotting module contains the autocorrelation_plot() function, which takes three parameters, two of which are optional. You must tell it the time-series data to plot, and if you wish, you may enter optional parameters to pass to Matplotlib for plotting.

For more information on the function, here is the link:

https://pandas.pydata.org/docs/reference/api/pandas.plotting.autocorrelation_plot.html#pandas.plotting.autocorrelation_plot

We won’t use the options; we’ll simply pass the time-series data to the function, and we’ll give Matplotlib a title to add to the final plot.

import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot

income_vs_expenditure_timeseries = {

    'Date': ['2012', '2013', '2014', '2015', '2016', 
             '2017', '2018', '2019', '2020', '2021'],

    'Household Income': [65000, 70850, 77227, 84177, 91753, 
                         100011, 109012, 118823, 129517, 141173],

    'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 
                              71630, 77360, 83549, 90233, 97451],
}

# Preparing the data frame with Pandas
dataframe = pd.DataFrame(income_vs_expenditure_timeseries)

dataframe["Date"] = dataframe["Date"].astype("datetime64")

dataframe = dataframe.set_index("Date")

# Giving a plot title, calling the autocorrelation function then showing the plot
plt.title("Income vs Expenditure Time Series Plot")

autocorrelation_plot(dataframe)

plt.show()

So here’s the result of all your work! Your first autocorrelation plot.

So What Does It All Mean?

Let’s pull the plot apart to help you understand what it is you’re seeing. On the x-axis, you have a label of 'Lag'. This axis is the time series increment, which in our case is ten one-year steps, given that the data we used was based on annual household income and expenditure.

The y-axis is the correlation axis, and it’s this axis we should understand better. The y-axis ranges from -1 through 0 and up to +1. The -1 line indicates a strong negative correlation in the data set, where an increase in one time series results in a proportionate decrease in another time series. The +1 line indicates a strong positive calculation in the data set, where an increase in one time series results in a proportionate increase in another.

If there is considerable randomness between the time series, the plotted line will be close to zero. The more significant the correlation between the time series, the further the plot from the zero line. The two dotted horizontal lines on the plot are the computed 95% confidence levels.

Given our overly simple example with few data points over a small time-frame, the plot will make more sense if we show a similar plot with greater randomness between the time series.

import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot

income_vs_expenditure_timeseries = {

    'Date': ['2012', '2013', '2014', '2015', '2016', 
             '2017', '2018', '2019', '2020', '2021'],

    'Household Income': [65000, 70850, 77227, 84177, 91753, 
                         100011, 109012, 118823, 129517, 141173],

    'Household Expenditure': [28750, 174782, 32402, 13513, 102753, 
                              54345, 132987, 12927, 98098, 45409],
}

dataframe = pd.DataFrame(income_vs_expenditure_timeseries)
dataframe["Date"] = dataframe["Date"].astype("datetime64")

dataframe = dataframe.set_index("Date")

# print(dataframe)

# plt.xlabel("Date")
# plt.ylabel("Values")
# plt.title("Income vs Expenditure Time Series Plot")
# plt.plot(dataframe)

autocorrelation_plot(dataframe)

plt.show()

In this data, the household expenditure is all over the place year by year, and appears to bear no resemblance to the household income. Here, then, is the plot from these data.

It’s obvious from a cursory examination that this plot shows little to no correlation, bouncing around the zero line, providing a useful comparison to our previous plot.

In Summary

We’ve introduced correlograms and autocorrelograms and spoken of their use in highlighting correlation or randomness in datasets and time-series data, respectively.

We imported Pandas and used it to create a data frame from our dictionary data. We changed the string-type date column to an integer type and set the date column as the data frame index.

We then used the Matplotlib Pyplot module to configure the plot with a title before creating the autocorrelation chart using the Pandas function, autocorrelation_plot(), found in the Plotting module. Finally, we again used the Pyplot module to show the finished graph.

If you need time-series data to create your own autocorrelation plot, you can find many free datasets at the following sites:

Thanks for reading, and I hope you found the article useful.