A **correlogram **is a chart used in data analysis to check for randomness in a data set, hence the name. The less the degree of randomness, the more there is a correlation between the data. The correlogram chart highlights any potential statistical significance between data points.

An **autocorrelogram **checks for the same degree of correlation or randomness but in data with a discrete-time sequence between the data points. Such data is known as time-series data. Also known as serial correlation, autocorrelation compares data with a delayed copy of itself to identify possible repeating patterns.

This article will investigate the Pandas Plotting module function, `autocorrelation_plot()`

, and use it to create an autocorrelation plot for some time-series data.

Table of Contents

## Building Our First Autocorrelogram

There are three main steps to creating an autocorrelogram in Python. First, we need to create or access some time-series data. We’ll manually create a small dataset showing 10-years of annual household gross income for this first example. We’ll also show annual household expenditure. Of course, you could use the Pandas `read_csv()`

function or one of the other Pandas functions to import data from a separate file; however, for this example we’ll simply create the data in-code. Here’s the data.

income_vs_expenditure_timeseries = { 'Date': ['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'], 'Household Income': [65000, 70850, 77227, 84177, 91753, 100011, 109012, 118823, 129517, 141173], 'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 71630, 77360, 83549, 90233, 97451], }

So we have a time series in the form of a dictionary, showing the year, annual household income, and annual household expenditure for 10-years. If your household is anything like mine, the more you earn, the more you spend. Therefore, I’ve created these fictitious sums to show a similar trend.

For the next step, we need to import Pandas and create a data frame from the time series data we have just entered. We’ll also need to turn the date information from a string into a format recognisable as a date. Finally, we will set the index for our plot to be the Date column, as our income and expenditure figures are annual.

import pandas as pd income_vs_expenditure_timeseries = { 'Date': ['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'], 'Household Income': [65000, 70850, 77227, 84177, 91753, 100011, 109012, 118823, 129517, 141173], 'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 71630, 77360, 83549, 90233, 97451], } dataframe = pd.DataFrame(income_vs_expenditure_timeseries) dataframe["Date"] = dataframe["Date"].astype("datetime64") dataframe = dataframe.set_index("Date")

In this previous code, we’ve imported Pandas with an alias of `pd`

. The time-series data is as you saw it earlier. Then we used the `pd.dataframe()`

function to create a dataframe from our time series.

Next, we used the `astype()`

function to change the date column to an integer from a string. Function `astype()`

returns a new data frame with the named data type changed to the specified format. You can use `astype()`

to change the entire data frame, or when using a dictionary as we are, allocate a data type to each column. Finally, the `set_index()`

function makes the Date column the dataframe index.

Now we’ll add a print command to see the appearance of our data frame.

import pandas as pd income_vs_expenditure_timeseries = { 'Date': ['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'], 'Household Income': [65000, 70850, 77227, 84177, 91753, 100011, 109012, 118823, 129517, 141173], 'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 71630, 77360, 83549, 90233, 97451], } dataframe = pd.DataFrame(income_vs_expenditure_timeseries) dataframe["Date"] = dataframe["Date"].astype("datetime64") dataframe = dataframe.set_index("Date") print(dataframe) # Result ''' Household Income Household Expenditure Date 2012-01-01 65000 48750 2013-01-01 70850 52650 2014-01-01 77227 56862 2015-01-01 84177 61411 2016-01-01 91753 66324 2017-01-01 100011 71630 2018-01-01 109012 77360 2019-01-01 118823 83549 2020-01-01 129517 90233 2021-01-01 141173 97451 '''

With the data frame completed to our satisfaction, we’ll import the Matplotlib Pyplot module to allow us to configure and show our final plot. Before we create an autocorrelogram, let’s look at a regular plot of our data frame to see what the data shows us.

import pandas as pd import matplotlib.pyplot as plt income_vs_expenditure_timeseries = { 'Date': ['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'], 'Household Income': [65000, 70850, 77227, 84177, 91753, 100011, 109012, 118823, 129517, 141173], 'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 71630, 77360, 83549, 90233, 97451], } # Preparing the data frame with Pandas dataframe = pd.DataFrame(income_vs_expenditure_timeseries) dataframe["Date"] = dataframe["Date"].astype("datetime64") dataframe = dataframe.set_index("Date") # Preparing and showing a regular plot with Matplotlib Pyplot plt.xlabel("Date") plt.ylabel("Values") plt.plot(dataframe) plt.show()

In this code, we imported `matplotlib.pyplot`

as the alias, `plt`

. We then created labels for the `x`

and `y`

axes of our plot, before plotting the result. Here’s what our data frame looks like in graphical format.

This chart shows the apparent correlation between the blue line, which is household income, and the orange household expenditure line. Each year, as the income has increased, so too has our expenditure. While the expenditure is not on a one-for-one basis, there is an obvious positive correlation between the two.

While the correlation between our simple data is evident from a casual glance, a comprehensive time-series dataset can be anything but obvious. In complex cases, the autocorrelation graph will better identify seasonality and correlation. Now we know there is a correlation in our simple data set, let’s see what an autocorrelation plot will tell us.

## Using pandas.plotting.autocorrelation_plot() Function

The Pandas plotting module contains the `autocorrelation_plot()`

function, which takes three parameters, two of which are optional. You must tell it the time-series data to plot, and if you wish, you may enter optional parameters to pass to Matplotlib for plotting.

For more information on the function, here is the link:

We won’t use the options; we’ll simply pass the time-series data to the function, and we’ll give Matplotlib a title to add to the final plot.

import pandas as pd import matplotlib.pyplot as plt from pandas.plotting import autocorrelation_plot income_vs_expenditure_timeseries = { 'Date': ['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'], 'Household Income': [65000, 70850, 77227, 84177, 91753, 100011, 109012, 118823, 129517, 141173], 'Household Expenditure': [48750, 52650, 56862, 61411, 66324, 71630, 77360, 83549, 90233, 97451], } # Preparing the data frame with Pandas dataframe = pd.DataFrame(income_vs_expenditure_timeseries) dataframe["Date"] = dataframe["Date"].astype("datetime64") dataframe = dataframe.set_index("Date") # Giving a plot title, calling the autocorrelation function then showing the plot plt.title("Income vs Expenditure Time Series Plot") autocorrelation_plot(dataframe) plt.show()

So here’s the result of all your work! Your first autocorrelation plot.

## So What Does It All Mean?

Let’s pull the plot apart to help you understand what it is you’re seeing. On the x-axis, you have a label of `'Lag'`

. This axis is the time series increment, which in our case is ten one-year steps, given that the data we used was based on annual household income and expenditure.

The y-axis is the correlation axis, and it’s this axis we should understand better. The y-axis ranges from -1 through 0 and up to +1. The -1 line indicates a strong negative correlation in the data set, where an increase in one time series results in a proportionate decrease in another time series. The +1 line indicates a strong positive calculation in the data set, where an increase in one time series results in a proportionate increase in another.

If there is considerable randomness between the time series, the plotted line will be close to zero. The more significant the correlation between the time series, the further the plot from the zero line. The two dotted horizontal lines on the plot are the computed 95% confidence levels.

Given our overly simple example with few data points over a small time-frame, the plot will make more sense if we show a similar plot with greater randomness between the time series.

import pandas as pd import matplotlib.pyplot as plt from pandas.plotting import autocorrelation_plot income_vs_expenditure_timeseries = { 'Date': ['2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'], 'Household Income': [65000, 70850, 77227, 84177, 91753, 100011, 109012, 118823, 129517, 141173], 'Household Expenditure': [28750, 174782, 32402, 13513, 102753, 54345, 132987, 12927, 98098, 45409], } dataframe = pd.DataFrame(income_vs_expenditure_timeseries) dataframe["Date"] = dataframe["Date"].astype("datetime64") dataframe = dataframe.set_index("Date") # print(dataframe) # plt.xlabel("Date") # plt.ylabel("Values") # plt.title("Income vs Expenditure Time Series Plot") # plt.plot(dataframe) autocorrelation_plot(dataframe) plt.show()

In this data, the household expenditure is all over the place year by year, and appears to bear no resemblance to the household income. Here, then, is the plot from these data.

It’s obvious from a cursory examination that this plot shows little to no correlation, bouncing around the zero line, providing a useful comparison to our previous plot.

## In Summary

We’ve introduced correlograms and autocorrelograms and spoken of their use in highlighting correlation or randomness in datasets and time-series data, respectively.

We imported Pandas and used it to create a data frame from our dictionary data. We changed the string-type date column to an integer type and set the date column as the data frame index.

We then used the Matplotlib Pyplot module to configure the plot with a title before creating the autocorrelation chart using the Pandas function, `autocorrelation_plot()`

, found in the Plotting module. Finally, we again used the Pyplot module to show the finished graph.

If you need time-series data to create your own autocorrelation plot, you can find many free datasets at the following sites:

Thanks for reading, and I hope you found the article useful.

David is a Python programmer and a technical writer creating in-depth articles for readers wanting uncomplicated explanations for topics made difficult by industry jargon. Also a woodworker, metalworker, landscape photographer, and pilot, he is freelance after 42 years in the corporate world. He has an MBA in Technology.