Here’s how the end result of this short tutorial will look like — beautiful, isn’t it?
Let’s have a quick look at the parameters and syntax first.
RadViz Parameters and Syntax
pandas.plotting.radviz(frame, class_column, ax=None, color=None, colormap=None, **kwargs)
|Refers to the data which you require plotted. The documentation suggests normalizing the data range to between 0.0 and 1.0|
|The name of the column that contains the class names|
|The matplotlib axes object, which defaults to |
|Assign a color to each category (e.g., list).|
|The colormap from which the plot colors are selected, with a default of None|
|Options you pass to the Matplotlib scatter plotting method.|
Working with RadViz
Pandas is Python’s module for working with tabular data. Pandas are often used for ingesting, organizing, and analyzing large data sets. This module provides various tools for working with data like data wrangling, cleaning, manipulation, etc. Data plotting also belongs among them.
RadViz is useful in situations where more than 3-dimensional data are available. Thanks to RadViz, a data scientist can visualize N-dimensional data set into a 2D plot.
RadViz plots each feature dimension uniformly around the circumference of a circle. Then, it plots points on the interior of the circle such that the point normalizes its values on the axes from the center to each arc.
💡 That may sound a bit abstract, though. Essentially, it is necessary to set up a group of points in a plane. These points are spaced on a unit circle, and each point represents a single attribute. Each sample in the data points is connected to these points proportionally to its numerical value. The point in the plane (i.e., an “equilibrium” of the numerical values) represents our sample.
All this can sound hard to imagine, so let’s try a concrete example instead. We need a sample; in this case, we will use one of the famous data sets – the Iris flower data set. It is a simple set looking like this:
The British statistician and biologist Ronald Fisher created this set in 1936. It captures three species of Iris, together with their measures.
First, we need to import visualization tools. Then, we import the CSV file to Python. The first few lines look like this:
# importing visualization tools import pandas as pd import seaborn as sns import matplotlib as plt # preparing the data colnames = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'Species'] url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" # loading data into a DataFrame iris = pd.read_csv(url, names=colnames, header=None) # peaking into the data print(iris.head())
sepal_length sepal_width petal_length petal_width Species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
The file consists of three Species as written above:
Iris-versicolor 50 Iris-virginica 50 Iris-setosa 50 Name: Species, dtype: int64
To work with RadViz, we need to import this tool first. And now we can display our own visualization:
from pandas.plotting import radviz radviz(iris, "Species")
The picture can be interpreted that the species from Iris-virginica are more random because they are close to the center of the circle. On the other hand, Iris-setosa species are more biased towards the “sepal width” parameter of these flowers.
There are several principles on how RadViz displays the points. These are mainly:
- Points with equal coordinate values will lie close to the center.
- Points with similar values but opposite dimensions will lie close to the center.
- Points with one or two coordinate values greater than the others lie closer to these dimensions.
- RadViz belongs to radial visualizations that enable to display n-dimensional data points in a 2D visualization.
- It uses so-called spring constants to represent relational values between points.
- In the first step, n-dimensional data points are laid out as points equally spaced around the perimeter of a circle.
- One end of n springs is attached to these n perimeter points, whereas the other ends of the springs are connected to a data point.
- The spring constant Ki equals the values of the i-th coordinate of the fixed point. Each data point will be displayed where the sum of spring forces equals 0.