RadViz in Pandas Plotting – How It Works

Rate this post

β–Ά Try It Yourself: You can run all code snippets in this article yourself in our interactive Jupyter notebook.

Here’s how the end result of this short tutorial will look like — beautiful, isn’t it?

Let’s have a quick look at the parameters and syntax first.

RadViz Parameters and Syntax

pandas.plotting.radviz(frame, class_column, ax=None, color=None, colormap=None, **kwargs)
ParameterDescription
frameRefers to the data which you require plotted. The documentation suggests normalizing the data range to between 0.0 and 1.0
class_columnThe name of the column that contains the class names
axThe matplotlib axes object, which defaults to None
colorAssign a color to each category (e.g., list).
colormapThe colormap from which the plot colors are selected, with a default of None
**kwargsOptions you pass to the Matplotlib scatter plotting method.

Working with RadViz

Pandas is Python’s module for working with tabular data. Pandas are often used for ingesting, organizing, and analyzing large data sets. This module provides various tools for working with data like data wrangling, cleaning, manipulation, etc. Data plotting also belongs among them.

RadViz is useful in situations where more than 3-dimensional data are available. Thanks to RadViz, a data scientist can visualize N-dimensional data set into a 2D plot.

RadViz plots each feature dimension uniformly around the circumference of a circle. Then, it plots points on the interior of the circle such that the point normalizes its values on the axes from the center to each arc.

πŸ’‘ That may sound a bit abstract, though. Essentially, it is necessary to set up a group of points in a plane. These points are spaced on a unit circle, and each point represents a single attribute. Each sample in the data points is connected to these points proportionally to its numerical value. The point in the plane (i.e., an β€œequilibrium” of the numerical values) represents our sample.

All this can sound hard to imagine, so let’s try a concrete example instead. We need a sample; in this case, we will use one of the famous data sets – the Iris flower data set. It is a simple set looking like this:

The British statistician and biologist Ronald Fisher created this set in 1936. It captures three species of Iris, together with their measures.

First, we need to import visualization tools. Then, we import the CSV file to Python. The first few lines look like this:

# importing visualization tools
import pandas as pd
import seaborn as sns
import matplotlib as plt

# preparing the data
colnames = ['sepal_length', 
            'sepal_width', 
            'petal_length', 
            'petal_width', 
            'Species']
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# loading data into a DataFrame
iris = pd.read_csv(url, names=colnames, header=None)

# peaking into the data
print(iris.head())

The output:

   sepal_length  sepal_width  petal_length  petal_width      Species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

The file consists of three Species as written above:

print(iris['Species'].value_counts())

Output:

Iris-versicolor    50
Iris-virginica     50
Iris-setosa        50
Name: Species, dtype: int64

To work with RadViz, we need to import this tool first. And now we can display our own visualization:

from pandas.plotting import radviz
radviz(iris, "Species")

The picture can be interpreted that the species from Iris-virginica are more random because they are close to the center of the circle. On the other hand, Iris-setosa species are more biased towards the β€œsepal width” parameter of these flowers.

There are several principles on how RadViz displays the points. These are mainly:

  • Points with equal coordinate values will lie close to the center.
  • Points with similar values but opposite dimensions will lie close to the center.
  • Points with one or two coordinate values greater than the others lie closer to these dimensions.

Summary

  • RadViz belongs to radial visualizations that enable to display n-dimensional data points in a 2D visualization.
  • It uses so-called spring constants to represent relational values between points.
  • In the first step, n-dimensional data points are laid out as points equally spaced around the perimeter of a circle.
  • One end of n springs is attached to these n perimeter points, whereas the other ends of the springs are connected to a data point.
  • The spring constant Ki equals the values of the i-th coordinate of the fixed point. Each data point will be displayed where the sum of spring forces equals 0.