boxplot() – The Pandas.plotting Module

A box plot is a method used in statistics to graphically show a group, or groups, of numerical data with their quartiles identified. A box plot is often also called a box-and-whisker plot, as the plot may have lines extending from the box to show data outside the upper and lower quartiles.

In this article, we’ll quickly introduce you to the box plot and then show you how to use the function boxplot() from within the Pandas plotting module to create a plot from a .csv file.

What is a Boxplot?

A boxplot is a standard method of showing a dataset, highlighting five of the most important statistical measures. These are the minimum, maximum, median, and the first and third percentiles. The boxplot will also identify any data lying outside the minimum and maximum percentiles, known as outliers.

You may be asking how you get figures that lie outside the maximum and minimum percentiles? Well, that’s where an understanding of the interquartile range comes in. The interquartile range, also known as the middle-50%, is a statistical measure of data dispersion. If you take the first quartile away from the third quartile, you get the interquartile range.

You use this figure to set the minimum and maximum points for the range of data. Multiply 1.5 by the interquartile range and subtract it from the first quartile to calculate the minimum figure. Multiple 1.5 by the interquartile range and add it to the third quartile to calculate the maximum figure. Any data that lie outside these minimum and maximum points are treated as outliers. Here’s a sketch to show you the components of the box plot.

Boxplot
Figure: Boxplot annotated with standard terminology

Using The Pandas Plotting Module’s boxplot() Function

The Pandas plotting module has a library of statistical functions, one of which is the boxplot() function. boxplot() simplifies the analysis and graphical representation of the columns in a dataset.

We’ll be using the Palmer Archipelago (Antarctica) penguin dataset. It’s an ideal dataset for our purposes as it isn’t unwieldy yet nicely allows us to demonstrate the workings of this function.

The dataset looks at three different penguin species on three different Antarctic islands and captures the sex of each penguin, the length of its flippers, its weight, and the length and depth of its culmen (the upper ridge of its beak). We’ll run a box plot on the flipper length column and segregate the results by the penguin type and the island it came from.

The boxplot() Function Syntax

You will find information on the boxplot() function here. The syntax is straightforward, accepting the following parameters.

pandas.plotting.boxplot(data, column=None, by=None, ax=None, fontsize=None, rot=0, grid=True, figsize=None, layout=None, return_type=None, **kwargs)
ArgumentDescription
dataThe data frame created from whatever file format in which you stored the original data. In our example, a .CSV file.
columnOptional. If omitted, you’ll receive a set of box plots for all columns in the data frame. In our code, we’ll select the flipper_length_mm column.
byOptional. Allows you to group the box plot by one or more columns. In our case, we’ll want to see flipper length grouped by species and island.
axOptional. Use if you wish to use Matplotlib axes in your box plot. We will not use this parameter in our code.
fontsizeOptional. Will accept a float number to indicate the font size in points, or a string, such as 'small', 'medium', or 'large'.
rotOptional, with a default of 0. It allows label rotation by the entry of an integer or float, representing degrees.
gridOptional. True will show a grid on the box plot.
figsizeOptional. Will accept a tuple signifying the plot size in inches (x, y).
layoutOptional. Will accept a tuple signifying the number of columns, and rows, to use when displaying sub-plots (rows, cols).
return_typeOptional. Identifies the kind of object to return. The default is 'axes', which returns the matplotlib axes the boxplot is drawn on. Using 'dict' returns a dictionary, the values of which are the matplotlib lines of the boxplot, while 'both' returns a named tuple with the axes and dict.
**kwargsAllows entry of any other keyword arguments you may wish to pass to matplotlib.

Using boxplot()

First, we need to import Pandas and create a data frame from the .CSV file saved to our computer.

We will also use matplotlib.pyplot to plot the graph, so let’s do that code.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('C:\\Users\\David\\Downloads\\penguins_size.csv')

Now we simply need to call the boxplot() function with the various parameters inserted.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('C:\\Users\\David\\Downloads\\penguins_size.csv')

pd.plotting.boxplot(df, column=['flipper_length_mm'], by=['island', 'species'], 
                    grid=False,  figsize=(25,18),  fontsize=15)

plt.show()

We’ll run that and here’s the result.

This plot shows inter-island differences in flipper length between the Adelie penguins and the difference between the Chinstrap and Adelie penguins on Dream island. Note the outliers on three of the five box plots.

As a final note, there’s a shorthand method of calling the boxplot syntax, which looks like the following. Both will give the same return. I used the longhand method as it aligns with the syntax you’ll see on the pandas.plotting module page.

df.boxplot(column=['flipper_length_mm'], by=['island', 'species'], 
           grid=False, figsize=(25,18), fontsize=15)

In Summary

We talked about the box plot as a method used in statistics to graphically show a group, or groups, of numerical data with their quartiles identified. You may also hear the box plot called a box-and-whisker plot.

Before introducing the Pandas Plotting module function, boxplot(), we gave a quick overview of box plots and described their characteristics. Then we wrote some code using boxplot() and matplotlib.pyplot to interrogate the penguin dataset and produced a bootstrap plot of the flipper length column, allowing analysis.


To boost your Python skills, consider joining the Finxter email academy: