Andrews curves are used to identify structure in a multi-dimensional data set. By reducing complex data to a two-dimensional graph, we can more easily identify variables in the data that are associated, form clusters, or are outliers.
We’ll show you how to plot such graphs, but before we get to that, let’s ensure every reader has a basic understanding of what we’re discussing and the tools we’re using to achieve our output.
An Introduction To Andrews Curves
David F. Andrews is a statistician who, in 1972, came up with a method of plotting multi-dimensional data using his own smoothed form of a parallel coordinate plot. Using a limited version of a mathematical function called a Fourier Series, his equation creates a sine curve for each data series, and overlays these on the same plot. This display of multiple sine curves allows us to identify those areas where variables correlate and where they may form a cluster. We are also able to identify those that have little correlation or are outliers.
Where Are Andrews Curves Used?
Used in many different fields of science, Andrews Curves are standard in biology, quality control, semi-conductor manufacture, and sociology. For our purposes, Andrews Curves are helpful in machine learning, and they can also assist when carrying out ETL (Extract, Transform, and Load) tasks, by highlighting where data may require further cleaning before use.
Using Pandas And Matplotlib
For our demonstration, we’re going to use Pandas and Matplotlib. If you’re knowledgeable in both, feel free to jump to the next section.
Pandas is a module designed to be used with Python to carry out data analysis in fields such as finance, economics, and statistics. If you haven’t used Pandas before, it doesn’t come with Python; therefore, you need to install it with the following command.
pip3 install pandas
Matplotlib is Pythons very powerful plotting library, containing functions that create two-dimensional plots using data in a Python list or array. It, too, is installed with the pip package manager using the following command.
pip3 install matplotlib
Within Matplotlib, we’ll be using the submodule Pyplot
, which assists us in plotting and visualising data.
Once the Pandas and Matplotlib.Pyplot
packages are installed on your system, you need to import them into your code. I’ve used the commonly used aliases ‘pd
‘ and ‘plt
‘ in the following example.
# Importing necessary packages import pandas as pd import matplotlib.pyplot as plt
Creating Our DataFrame
A dataframe is simply a two-dimensional data structure storing tabular data. Think of it as Pythons own Excel spreadsheet, held in memory.
When writing your first Python code, it’s traditional that you always begin with ‘Hello World’. When creating your first Andrews Curves, it seems traditional that you start with 'Iris Data'
.
The Iris flower dataset is used for beginners in machine learning and stems from the mid-1930s. One of the best-known databases to be found in the literature regarding pattern recognition, the Iris dataset provides a multivariate dataset containing 50 samples each from three different species of Iris. The features measured were the width and length of the Iris’ petals and sepals.
To access Iris Data, use this link, then change the file type to .csv
. I then added the following column headers to the sheet in row #1.
Now we need to have Pandas create our DataFrame. Remember the location in which you’ve saved the CSV file, as you’ll need it now to replace the pathname I have used below.
# Importing necessary packages import pandas as pd import matplotlib.pyplot as plt # Make a data frame from our csv file df = pd.read_csv('C:\\Users\\david\\downloads\\iris.csv')
At this point, the CSV has been transformed to a DataFrame and assigned to variable df
. Now we need Pandas to create Andrews Curves from the data contained in our DataFrame.
The Pandas Plotting Module
Within the plotting module, there are twelve functions, of which one is for plotting Andrews Curves. The syntax of the function is as follows;
pandas.plotting.andrews_curves(frame, class_column, ax=None, samples=200, color=None, colormap=None, **kwargs)
Here’s the meaning of the parameters:
Argument | Description |
---|---|
frame | Refers to the data which you require plotted. The documentation suggests it is better to normalise the data range to between 0.0 and 1.0 |
class_column | The name of the column that contains the class names |
ax | The matplotlib axes object, which defaults to None |
samples | The number of points to be plotted for each curve |
colormap | The colormap from which the plot colors are selected, with a default of None |
**kwargs | Options you pass to the matplotlib plotting method. |
In our case, I’m happy to accept the defaults, only specifying the dataframe, the column_name
, and the smoothness of the curve using samples=250
. Here’s the code.
# Importing necessary packages import pandas as pd import matplotlib.pyplot as plt # Make a data frame from our csv file df = pd.read_csv('C:\\Users\\david\\downloads\\iris.csv') # Creating Andrews curves x = pd.plotting.andrews_curves(df, 'Class', samples=250)
Now we’re done with Pandas. We’ve used it to read the CSV file and assign it to a variable, and then we’ve called the plotting function for Andrews Curves remembering to use the ‘pd
‘ alias in this case. Finally, we’ve assigned the plotted curves to variable x
.
Enter The Matplotlib Module, Pyplot
Now we’re ready to output the plotted figure with the matplotlib module, pyplot. At a quick count, pyplot has over 150 functions to create the graph and style we wish. For details on those functions, visit this link.Weβll only need three. First, we’ll create the plot, then add a title, and finally, reveal the Andrews Curves. Here’s the code in its entirety.
# Importing necessary packages import pandas as pd import matplotlib.pyplot as plt # Make a data frame from our CSV file df = pd.read_csv('C:\\Users\\david\\downloads\\iris.csv') # Creating Andrews curves x = pd.plotting.andrews_curves(df, 'Class', samples=250) # Plot the Curve x.plot() # Give the plot a title plt.title("A Finxters Iris Plant classification using Andrews Curves") # Display the final output plt.show()
And the grand reveal of all that work?
This plot illustrates that the human eye is still extremely useful in pattern matching, with one flower linearly separable from the other two. The other two are not linearly separable from each other. It is the Iris-versicolor and the Iris-virginica that share strong similarities, while the Iris-setosa differs.
While not immediately apparent when looking at the CSV data, the Iris-setosa has a petal length that is less than its sepal width. In contrast, the petal lengths are longer than their sepal width with the other two varieties.
In Summary
- This article introduced the Pandas plotting module; specifically, one of its functions used to create Andrews Curves.
- We learned that Andrews Curves were introduced in the early 1970s by David F. Andrews as a method of plotting multi-dimensional data allowing us to identify areas where variables correlate and where they may form a cluster. They also enable us to identify those data that have little correlation or are outliers.
- Installing the Pandas and Matplotlib modules, we used Pandas to import a CSV data file and plot the required Andrews Curves, assigning the plot to a variable.
- We then used the Matplotlib submodule, Pyplot to name, plot and show the final graphical output.
I hope this article and the accompanying video have been helpful!