This article deals with calculating percentiles. Percentiles are statistical indicators that are used to describe specific portions of a sample population. The following sections will explain what percentiles are, what they are used for and how to calculate them, using Python. As you will see, Python allows solving this problem in multiple ways, either by manually defining a function or by exploiting Numpy.
What Are Percentiles?
Percentiles are statistical indicators that are often used to identify a certain part of a sample population. More precisely, we use a percentile in order to indicate the value (of the variable that is under consideration) below which a specific percentage of the sample population falls. For example, if we consider the height distribution of all the English people living in UK; by saying that the height value of 180 cm identifies the 65th percentile, it means that the 65% of all the English people living in UK are shorter than 180 cm. As you can imagine, percentile are commonly used in lots of statistical studies and when reporting results of surveys or measurements on large populations.
How to Calculate Percentiles?
Let’s assume to have collected the height of n = 58 people; in order to evaluate the percentiles referred to this distribution, the first step is to sort all the values in ascending order. At this point, suppose we are asked to calculate the 75th percentile of the distribution; we calculate the so-called rank k = percentile/100. In this case, k = 75/100 = 0.75. Now we have to multiply the rank for the total number of samples in the distribution (n, in this case 58); we hence obtain k x n = 0.75 x 58 = 43.5. Since the result is not a whole number, we approximate the value to the nearest whole number (44 in this case). The next step consists in finding the height value corresponding at the 44th position within the sample distribution; that value corresponds to the 75th percentile. In the case the result of k x n is a whole number, we proceed further by directly finding the corresponding value in the sample distribution; that is already our percentile.
Calculate Percentiles in Python
Now that we know what percentiles are and how they can be calculated, we will see how Python makes this task very easy and quick. In the first part, we will solve the problem by defining a function that execute all the steps illustrated in the previous section while in the second part, we will exploit the Numpy built-in function .percentile().
Importing the Appropriate Libraries
We start our script by importing the libraries that will be used throughout the example.
We need to import
mathfor being able to round floating numbers to the nearest integer,
- Numpy for the function
- Matplotlib for the final part, in which we will plot the values of the percentiles on the distribution.
import numpy as np import math import matplotlib.pyplot as plt
Writing a Python Function for Calculating Percentiles
In this first section we will see how to build up a function for calculating the percentiles. The aim of this section is purely didactic, as you will see later on, Python offers built-in libraries that solve the task automatically. However, it’s always important to understand how the problem gets solved and how a specific Python function works.
def my_percentile(data, percentile): n = len(data) p = n * percentile / 100 if p.is_integer(): return sorted(data)[int(p)] else: return sorted(data)[int(math.ceil(p)) - 1]
We start by calling our function
my_percentile, it will take as input parameters the sample distribution and the percentile that we want to calculate. As detailed above, the first step is to evaluate the size of our distribution (n); then we compute the product “p” of the sample size and the rank.
At this point we have to instantiate an if statement, in order to separate the case in which k x n is a whole number from the case in which it is not. We exploit the Python method
.is_integer() to evaluate whether
p is a whole number; this method returns
True in the positive case.
True, we have to search for the p-th values in our distribution (sorted in ascending order). To sort the distribution in ascending order, we used the function
sorted() and pass as input parameter the distribution itself. The important thing to remember is to convert p from float (since it comes from the mathematical operation done in the previous line) to integer; otherwise you will get an error that says that the index value of the list should be an integer number.
We conclude by raising an else statement which covers the case in which the value of
p is not a whole number; in this case, by using the function
.ceil() (from the
math library), we approximate the value of
p to the nearest higher integer.
We then convert this number to an integer and subtract 1 in order to match the indexing used in the lists. The following code lines you can find all the steps explained so far, in this section.
Calculating percentiles using our function
To check whether our function works fine, we first have to define a distribution of values; to do that, we can exploit the Numpy function
.random.randn(), which draws random values from the normal distribution, we just have to pass as input parameter the size of the array. We choose to create an array of 10000 values.
dist = np.random.randn(10000)
Let’s now try to calculate the values of the 5th, 25th, 50th, 75th and 95th percentiles. We can hence build a list, called “
perc_func” that contains all those percentiles, evaluated through our function. Before doing that, we define a list called “
index” that contains the values of the percentiles that we are interested in. After that, we exploit list comprehension, to call the function
my_percentile() for each percentile defined in the list “
index = [5, 25, 50, 75, 95] perc_func = [my_percentile(dist, i) for i in index]
At this point, the list “
perc_func” should contain the values corresponding to all the percentiles listed in the list “
Calculating percentiles using Numpy.percentiles()
Now that we know how to calculate percentiles of a distribution, we can also exploit the Numpy built-in function, to do it more rapidly and efficiently.
.percentile() function takes as input parameters the sample distribution and the percentile that we are interested in. It also allows us to decide which method should be used in the case the product n x k is not a whole number; indeed, there is not just a single correct way to do that, previously we decided to approximate the value to the nearest integer; however we could also choose to approximate it to the closest higher/lower integer or to take the mean value between the lower and higher nearest integers.
All these options can be selected by choosing among these key words for the option “
['linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’].
You can find the complete documentation on the
.percentile() function here.
The different options may lead to slightly different results, we choose the option “
nearest”, in order to match the method used in the function “
my_percentile”. In a similar way to what we did in the previous section, we create a list called “
perc_numpy” in which we store the values of the 5th, 25th, 50th, 75th and 95th percentiles, evaluated using the Numpy. The following code lines describe the just-explained procedures.
# Using numpy for calculating percentiles perc_numpy = [np.percentile(dist, i, interpolation='nearest') for i in index]
We can now print the two lists and check whether the obtained results are equal.
Plotting the Percentiles on the Distribution
At the beginning of the article, we defined what percentiles represent.
Since statistical definitions can be rather difficult to grasp, we can display our distribution of values and see where the calculated percentiles are located in the distribution.
To do that, we exploit Matplotlib and the function
.axvline(), which allows plotting vertical lines on a plot. We place the function
axvline() into a for loop in order to create a vertical line for each percentile contained in the list “
perc_func”. To better highlight the percentile lines, we use the color red.
# Plotting plt.hist(dist, 50) for i in range(len(index)): plt.axvline(perc_func[i], color='r') plt.show()
The final result is displayed in Figure 1; as you can see, the 50th percentile is located right in the middle of the distribution, while the 95th percentile is the last line and corresponds to the value below which we can find the 95% of the sample population.
Figure 1: Representation of the normal distribution used in the example, with the vertical red lines corresponding (from left to right) to the 5th, 25th, 50th, 75th and 95th percentiles.
In this article we learnt about percentiles, what they are, what they represent and how they can be used to describe a portion of a sample distribution. From their statistical definition, we developed a Python function for calculating the percentiles of a sample distribution.
After that, we explored the Numpy function
.percentile() which allows calculating percentiles of a sample distribution in a super-fast and efficient way.
We then compared the results of the two methods and checked that they were identical.
In the end, we also showed graphically the percentiles, that we previously calculated, on the sample distribution, in order to have a better understanding of their actual meaning.