In this tutorial, we learn about the Pandas function qcut()
. This function creates unequal-sized bins with the same number of samples in each bin.
Here are the parameters from the official documentation:
Parameter | Type | Description |
x | 1d ndarray or Series | |
q | int or list of float values | Number of quantiles. Alternately: array of quantiles. |
labels | array or False , default: None | Used as the labels for the resulting bins. Must be of the same length as the resulting bins. If False: returns only integer indicators of the bins. If True: raises an error. |
retbins | bool , optional | Whether to return the bins/labels. |
precision | int , optional | The precision at which to store and display the bin labels. |
duplicates | {default 'raise', 'drop'} ,optional | If the bin edges are not unique: raise ValueError or drop the non-uniques. |
Returns | Type | Description |
out | Categorical or Series or array of integers if labels is set to False | The return type depends on the input: a Series of type Category if input is a Series , else Categorical . Bins are represented as categories when categorical data is returned. |
bins | ndarray of floats | Only ifΒ retbins Β is set to True . |
Basic Example
Let’s create a data frame that we will be using throughout the tutorial:
import pandas as pd df = pd.DataFrame({'Competitor':['Alice', 'Mary', 'John', 'Ann', 'Bob', 'Jane', 'Tom', 'Vincent', 'Ella'], 'Score':[1,6,11,2,9,16,5,2,19]}) print(df)
Competitor | Score | |
0 | Alice | 1 |
1 | Mary | 6 |
2 | John | 11 |
3 | Ann | 2 |
4 | Bob | 9 |
5 | Jane | 16 |
6 | Tom | 5 |
7 | Vincent | 2 |
8 | Ella | 19 |
We import the Pandas library and then we create a Pandas data frame which we assign to the variable “df
“. The outputted data frame provides information about several competitors and a score that each competitor reached.
Now, we apply the qcut()
function:
pd.qcut(x = df['Score'], q = 3)
0 | (0.999, 4.0] |
1 | (4.0, 9.667] |
2 | (9.667, 19.0] |
3 | (0.999, 4.0] |
4 | (4.0, 9.667] |
5 | (9.667, 19.0] |
6 | (4.0, 9.667] |
7 | (0.999, 4.0] |
8 | (9.667, 19.0] |
Name: Score, dtype: category |
Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]] |
Inside the function, we put in “df['Score']
” as the value for the parameter “x
” to state that this is the column that we want to use to calculate the bins on. The second argument is “3” which we assign to the “q
” parameter. This is the number of quantiles.
The output assigns each score to an interval. There are a few things to observe here.
First, we can see at the bottom of the output the intervals in order (“(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]
“). The intervals start with parenthesis and end with square brackets. That means that the left value is not included in the interval, but the right one is. For example, “0.999” is not included, whereas “4.0” is included.
Additionally, we can see that the intervals do not have the same size. The first interval has a size of 3, the second has a size of 5.667 and the third one has a size of 9.333. Why are the intervals these particular sizes?
To answer that, we have to take a look at the number of values in each interval:
pd.qcut(x = df['Score'], q = 3).value_counts()
(0.999, 4.0] | 3 |
(4.0, 9.667] | 3 |
(9.667, 19.0] | 3 |
Name: score, dtype: int64 |
We use the value_counts()
function to achieve that. We can see that each bin has an equal amount of values. By assigning “3” to the “q
” parameter we state that we want to get three intervals. And each interval should contain just as many values as the others. So, the interval sizes adjust to that.
To make it better visible which interval belongs to which score, we create a new column for the data frame:
df['Category'] = pd.qcut(x = df['Score'], q = 3) print(df)
Competitor | Score | Category | |
0 | Alice | 1 | (0.999, 4.0] |
1 | Mary | 6 | (4.0, 9.667] |
2 | John | 11 | (9.667, 19.0] |
3 | Ann | 2 | (0.999, 4.0] |
4 | Bob | 9 | (4.0, 9.667] |
5 | Jane | 16 | (9.667, 19.0] |
6 | Tom | 5 | (4.0, 9.667] |
7 | Vincent | 2 | (0.999, 4.0] |
8 | Ella | 19 | (9.667, 19.0] |
We create a new column called “Category
” which contains the intervals and we add it to the existing data frame.
The “q” parameter
In the previous example, we set the “q
” parameter equal to “3”. Of course, we can also assign other values here. Apart from an integer value, we can assign this parameter a list:
pd.qcut(x = df['Score'], q = [0, .25, .5, .75, 1.])
Output:
0 (0.999, 2.0] 1 (2.0, 6.0] 2 (6.0, 11.0] 3 (0.999, 2.0] 4 (6.0, 11.0] 5 (11.0, 19.0] 6 (2.0, 6.0] 7 (0.999, 2.0] 8 (11.0, 19.0] Name: Score, dtype: category Categories: (4, interval[float64, right]): [(0.999, 2.0] < (2.0, 6.0] < (6.0, 11.0] < (11.0, 19.0]]
This way, we directly determine how many percent of the values are included in each interval. For example, the first interval (0.999, 2.0] contains the first 25% of the score values. Since the intervals we created here all have the same length of 25%, we should get an equal amount of values in each interval.
Let’s see if that’s the case:
pd.qcut(x = df['Score'], q = [0, .25, .5, .75, 1.]).value_counts()
Output:
(0.999, 2.0] 3 (2.0, 6.0] 2 (6.0, 11.0] 2 (11.0, 19.0] 2 Name: Score, dtype: int64
We make use of the value_counts()
function again. As we can see, the first interval contains one value more than the other ones. That’s because we have nine scores in total and nine cannot be divided by four. Consequently, the number of values per interval cannot be the same in all intervals.
The distance between the quantiles in the array does not have to be even:
pd.qcut(x = df['Score'], q = [0, .5, .7, .85, 1.])
Output:
0 (0.999, 6.0] 1 (0.999, 6.0] 2 (10.2, 15.0] 3 (0.999, 6.0] 4 (6.0, 10.2] 5 (15.0, 19.0] 6 (0.999, 6.0] 7 (0.999, 6.0] 8 (15.0, 19.0] Name: Score, dtype: category Categories: (4, interval[float64, right]): [(0.999, 6.0] < (6.0, 10.2] < (10.2, 15.0] < (15.0, 19.0]]
The first interval is way bigger than the other ones. Thus, the number of values per interval is not evenly distributed:
pd.qcut(x = df['Score'], q = [0, .5, .7, .85, 1.]).value_counts()
Output:
(0.999, 6.0] 5 (15.0, 19.0] 2 (6.0, 10.2] 1 (10.2, 15.0] 1 Name: Score, dtype: int64
As we can observe, the first interval contains the most score values.
Determine the Interval Precision
By now, the intervals we created all had a specific precision:
pd.qcut(x = df['Score'], q = 3)
Output:
0 (0.999, 4.0] 1 (4.0, 9.667] 2 (9.667, 19.0] 3 (0.999, 4.0] 4 (4.0, 9.667] 5 (9.667, 19.0] 6 (4.0, 9.667] 7 (0.999, 4.0] 8 (9.667, 19.0] Name: Score, dtype: category Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]
As we can see, there are three decimal places except for the integer values that only have “.0
” as the decimal place.
We can change that precision using the “precision
” parameter. This parameter expects an integer value which determines how many decimal places we want to get.
Let’s assign “5” here to get five decimal places:
pd.qcut(x = df['Score'], q = 3, precision=5)
Output:
0 (0.99999, 4.0] 1 (4.0, 9.66667] 2 (9.66667, 19.0] 3 (0.99999, 4.0] 4 (4.0, 9.66667] 5 (9.66667, 19.0] 6 (4.0, 9.66667] 7 (0.99999, 4.0] 8 (9.66667, 19.0] Name: Score, dtype: category Categories: (3, interval[float64, right]): [(0.99999, 4.0] < (4.0, 9.66667] < (9.66667, 19.0]]
In this manner, we create more precise intervals. How precise we should create them depends on the use case.
Print out the bins
If we want to print out the bins that we created, we apply the “retbins
” parameter and set it to “True
“:
pd.qcut(x = df['Score'],q = 3, retbins=True)
Output:
0 (0.999, 4.0] 1 (4.0, 9.667] 2 (9.667, 19.0] 3 (0.999, 4.0] 4 (4.0, 9.667] 5 (9.667, 19.0] 6 (4.0, 9.667] 7 (0.999, 4.0] 8 (9.667, 19.0] Name: Score, dtype: category Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]] array([1., 4., 9.66666667, 19.]))
The only difference here compared to when we did not apply the “retbins
” parameter is the additional line “array” at the bottom of the output. Here, we get the resulting bins inside an array.
This can be useful especially when we assign the “q
” parameter an integer as we did here instead of a list.
Define labels for the categories
We already saw how to create a new column to our data frame to see which score belongs to which interval:
df['Category'] = pd.qcut(x = df['Score'], q = 3) print(df)
Output:
Competitor Score Category 0 Alice 1 (0.999, 4.0] 1 Mary 6 (4.0, 9.667] 2 John 11 (9.667, 19.0] 3 Ann 2 (0.999, 4.0] 4 Bob 9 (4.0, 9.667] 5 Jane 16 (9.667, 19.0] 6 Tom 5 (4.0, 9.667] 7 Vincent 2 (0.999, 4.0] 8 Ella 19 (9.667, 19.0]
This way, we get a great overview of our data. However, assigning the intervals to the scores can be a bit confusing as we do not clearly see what a good score is and what isn’t.
This is where the “labels
” parameter comes into play. We can give each interval a label to categorize our data:
df['Category'] = pd.qcut(x = df['Score'], q = 3, labels=['bad', 'good', 'exceptional']) print(df)
Output:
Competitor Score Category 0 Alice 1 bad 1 Mary 6 good 2 John 11 exceptional 3 Ann 2 bad 4 Bob 9 good 5 Jane 16 exceptional 6 Tom 5 good 7 Vincent 2 bad 8 Ella 19 exceptional
The “labels
” parameter expects a list of the labels. We choose the labels "bad"
, "good"
, and "exceptional"
. So, the smallest interval is assigned the label "bad"
, the middle interval is assigned the label "good"
, and the biggest interval is assigned the label "exceptional"
.
Thus, we can categorize our data in a more user-friendly way.
Comparison with the cut() function
Chances are when you work with the qcut()
function, you have come across the cut()
function as well.
In this final section, we will see the difference between the qcut()
and the cut()
function.
Let’s refer to our initial example of the qcut()
function where we assigned the “q
” parameter the value “3”:
pd.qcut(x = df['Score'], q = 3)
Output:
0 (0.999, 4.0] 1 (4.0, 9.667] 2 (9.667, 19.0] 3 (0.999, 4.0] 4 (4.0, 9.667] 5 (9.667, 19.0] 6 (4.0, 9.667] 7 (0.999, 4.0] 8 (9.667, 19.0] Name: Score, dtype: category Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]
We created three quantiles in a way that each interval now contains the same amount of score values:
pd.qcut(x = df['Score'], q = 3).value_counts()
Output:
(0.999, 4.0] 3 (4.0, 9.667] 3 (9.667, 19.0] 3 Name: score, dtype: int64
Now we do essentially the same with the cut()
function:
pd.cut(x = df['Score'], bins = 3)
Output:
0 (0.982, 7.0] 1 (0.982, 7.0] 2 (7.0, 13.0] 3 (0.982, 7.0] 4 (7.0, 13.0] 5 (13.0, 19.0] 6 (0.982, 7.0] 7 (0.982, 7.0] 8 (13.0, 19.0] Name: Score, dtype: category Categories: (3, interval[float64, right]): [(0.982, 7.0] < (7.0, 13.0] < (13.0, 19.0]]
The cut()
function does not provide a “q
” parameter, instead, it has the “bins
” parameter which we also assign the value “3” to create three bins.
As we can see, the intervals are different from the ones from the qcut()
function. Compared to the qcut()
function, these intervals all have the same size. They are all six units long.
However, the number of values in each interval is different:
pd.cut(x = df['Score'], bins = 3).value_counts()
Output:
(0.982, 7.0] 5 (7.0, 13.0] 2 (13.0, 19.0] 2 Name: Score, dtype: int64
Thus, qcut()
creates intervals that are not equally long but they all contain the same number of values. Whereas the cut()
function creates equal-sized intervals that don’t necessarily have the same number of values in them.
Summary
In this tutorial, we learned about the qcut()
function. We saw how to create intervals in several ways, how to determine the interval’s precision, how to label our categories, and we determined the differences to the cut()
function.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.
Happy Coding!