A box plot is a method used in statistics to graphically show a group, or groups, of numerical data with their quartiles identified. A box plot is often also called a box-and-whisker plot, as the plot may have lines extending from the box to show data outside the upper and lower quartiles.
In this article, we’ll quickly introduce you to the box plot and then show you how to use the function
boxplot() from within the Pandas plotting module to create a plot from a
What is a Boxplot?
A boxplot is a standard method of showing a dataset, highlighting five of the most important statistical measures. These are the minimum, maximum, median, and the first and third percentiles. The boxplot will also identify any data lying outside the minimum and maximum percentiles, known as outliers.
You may be asking how you get figures that lie outside the maximum and minimum percentiles? Well, that’s where an understanding of the interquartile range comes in. The interquartile range, also known as the middle-50%, is a statistical measure of data dispersion. If you take the first quartile away from the third quartile, you get the interquartile range.
You use this figure to set the minimum and maximum points for the range of data. Multiply 1.5 by the interquartile range and subtract it from the first quartile to calculate the minimum figure. Multiple 1.5 by the interquartile range and add it to the third quartile to calculate the maximum figure. Any data that lie outside these minimum and maximum points are treated as outliers. Here’s a sketch to show you the components of the box plot.
Using The Pandas Plotting Module’s boxplot() Function
The Pandas plotting module has a library of statistical functions, one of which is the
boxplot() simplifies the analysis and graphical representation of the columns in a dataset.
We’ll be using the Palmer Archipelago (Antarctica) penguin dataset. It’s an ideal dataset for our purposes as it isn’t unwieldy yet nicely allows us to demonstrate the workings of this function.
The dataset looks at three different penguin species on three different Antarctic islands and captures the sex of each penguin, the length of its flippers, its weight, and the length and depth of its culmen (the upper ridge of its beak). We’ll run a box plot on the flipper length column and segregate the results by the penguin type and the island it came from.
The boxplot() Function Syntax
You will find information on the
boxplot() function here. The syntax is straightforward, accepting the following parameters.
pandas.plotting.boxplot(data, column=None, by=None, ax=None, fontsize=None, rot=0, grid=True, figsize=None, layout=None, return_type=None, **kwargs)
|The data frame created from whatever file format in which you stored the original data. In our example, a .CSV file.|
|Optional. If omitted, you’ll receive a set of box plots for all columns in the data frame. In our code, we’ll select the |
|Optional. Allows you to group the box plot by one or more columns. In our case, we’ll want to see flipper length grouped by species and island.|
|Optional. Use if you wish to use Matplotlib axes in your box plot. We will not use this parameter in our code.|
|Optional. Will accept a float number to indicate the font size in points, or a string, such as |
|Optional, with a default of 0. It allows label rotation by the entry of an integer or float, representing degrees.|
|Optional. True will show a grid on the box plot.|
|Optional. Will accept a tuple signifying the plot size in inches (x, y).|
|Optional. Will accept a tuple signifying the number of columns, and rows, to use when displaying sub-plots (rows, cols).|
|Optional. Identifies the kind of object to return. The default is |
|Allows entry of any other keyword arguments you may wish to pass to matplotlib.|
First, we need to import Pandas and create a data frame from the .CSV file saved to our computer.
We will also use
matplotlib.pyplot to plot the graph, so let’s do that code.
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('C:\\Users\\David\\Downloads\\penguins_size.csv')
Now we simply need to call the
boxplot() function with the various parameters inserted.
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('C:\\Users\\David\\Downloads\\penguins_size.csv') pd.plotting.boxplot(df, column=['flipper_length_mm'], by=['island', 'species'], grid=False, figsize=(25,18), fontsize=15) plt.show()
We’ll run that and here’s the result.
This plot shows inter-island differences in flipper length between the Adelie penguins and the difference between the Chinstrap and Adelie penguins on Dream island. Note the outliers on three of the five box plots.
As a final note, there’s a shorthand method of calling the boxplot syntax, which looks like the following. Both will give the same return. I used the longhand method as it aligns with the syntax you’ll see on the
pandas.plotting module page.
df.boxplot(column=['flipper_length_mm'], by=['island', 'species'], grid=False, figsize=(25,18), fontsize=15)
We talked about the box plot as a method used in statistics to graphically show a group, or groups, of numerical data with their quartiles identified. You may also hear the box plot called a box-and-whisker plot.
Before introducing the Pandas Plotting module function,
boxplot(), we gave a quick overview of box plots and described their characteristics. Then we wrote some code using
matplotlib.pyplot to interrogate the penguin dataset and produced a bootstrap plot of the flipper length column, allowing analysis.
To boost your Python skills, consider joining the Finxter email academy:
David is a Python programmer and a technical writer creating in-depth articles for readers wanting uncomplicated explanations for topics made difficult by industry jargon. Also a woodworker, metalworker, landscape photographer, and pilot, he is freelance after 42 years in the corporate world. He has an MBA in Technology.