A bootstrap plot is a graphical representation of uncertainty in a characteristic chosen from within a population. While we can usually calculate data confidence levels mathematically, gaining access to the desired characteristics from some populations is impossible or impracticable. In this case, bootstrap sampling and the bootstrap plot come to our aid.
This article will introduce the concept of bootstrap sampling and then investigate the Pandas Plotting module function, bootstrap_plot()
. We’ll then use it to create plots for mean, median, and mid-range statistics from a given dataset.
What Is Bootstrap Sampling?
Suppose we wish to know the average age of the people in a particular football stadium on a specific day. Stopping each person and getting their age would be impracticable, delaying the game and angering many people.
At each of the four entry gates, we could take a random sample of five people and get the average of their ages. We then repeat this 50 times, per gate and this method gives us a reasonable average of the attending fans efficiently.
As a side note, in statistics, this process is called ‘sampling with replacement’ because there is the possibility that a fan leaves and returns through another gate, and we receive their age twice. If we were to utilize ‘sampling without replacement’, we’d have a way to identify the individual to ensure we exclude them from further sampling.
Developed by Brad Efron, read more about the bootstrapping technique on this Wikipedia page.
Using bootstrap plot
A bootstrap plot lets us take a large dataset and conduct the required sampling on a particular characteristic. In this article, we will use a dataset called the Brazilian E-Commerce Public Dataset. It consists of orders made at Olist Stores. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. The data is actual data that has been anonymized, with references to the companies and partners replaced with the names of Game of Thrones great houses. You can download the dataset here. We’ll be using a subset of that data, called the olist_order_payments_dataset.csv
.
Using bootstrap_plot()
, we will look at the payment value characteristic, the value of more than 100,000 purchases made in Bazilian Reais. The size of each sample will be dictated by us, as will the number of resamples carried out.
Syntax bootstrap_plot()
Information on the bootstrap_plot()
function may be found here. The syntax of the function is quite simple, as follows:
pandas.plotting.bootstrap_plot(series, fig=None, size=50, samples=500, **kwds)
Argument | Description |
---|---|
series | Dataset and characteristic you wish to be sampled |
fig | Defaults to None. If used, it uses the references given for plotting instead of using the defaults. See matplotlib.figure.Figure() for details |
size | Sample size |
sample | Number of samples to take |
**kwds | Default is None . Keyword arguments to pass options to the matplotlib plotting method. |
Using bootstrap_plot()
First, we need to import Pandas and create a data frame from the .csv
file saved to our computer. We will also use matplotlib.pyplot
to plot the graph, so that too should be imported.
We will also use matplotlib.pyplot
to plot the graph, so that too should be imported.
import pandas as pd import matplotlib.pyplot as plt # Make a data frame from our csv file df = pd.read_csv('C:\\Users\\david\\downloads\\olist_order_payments_dataset.csv')
We use the bootstrap_plot()
function to sample the data frame, referencing the column we wish to sample. I’ve set the sample size at 200, with resampling 500 times. I’ve also specified the color I wish to see used for the plot.
Finally, we ask matplotlib.pyplot
to show the plot.
import pandas as pd import matplotlib.pyplot as plt # Make a data frame from our csv file df = pd.read_csv('C:\\Users\\david\\downloads\\olist_order_payments_dataset.csv') x = pd.plotting.bootstrap_plot(df["payment_value"], size=200, samples=500, color="teal") plt.show()
When we run that, we receive the following output.
This plot allows us to see the sampling distribution for the statistic, identify the 95% confidence interval, and understand the statistic having a sampling distribution with the smallest variance. From these plots, we may understand the mean spend of 144 Reais, upper and lower confidence levels of 112 and 241 respectively, and a median of 101.
In Summary
We talked about the bootstrap plot as a graphical representation of uncertainty in a characteristic chosen from within a population, where gaining access to all the desired characteristics is impossible or impracticable.
Before introducing the Pandas Plotting module function, bootstrap_plot()
, we gave a quick overview of bootstrap sampling. Then we wrote some code using bootstrap_plot()
and matplotlib.pyplot
to carry out sampling of a large dataset and produce a bootstrap plot allowing analysis.