Pandas qcut() - A Simple Guide with Video - Be on the Right Side of Change

In this tutorial, we learn about the Pandas function qcut(). This function creates unequal-sized bins with the same number of samples in each bin.

Here are the parameters from the official documentation:

Parameter	Type	Description
`x`	1d `ndarray` or Series
`q`	`int` or list of float values	Number of quantiles. Alternately: array of quantiles.
`labels`	array or `False`, default: `None`	Used as the labels for the resulting bins. Must be of the same length as the resulting bins. If False: returns only integer indicators of the bins. If True: raises an error.
`retbins`	`bool`, optional	Whether to return the bins/labels.
`precision`	`int`, optional	The precision at which to store and display the bin labels.
`duplicates`	`{default 'raise', 'drop'}`, optional	If the bin edges are not unique: raise `ValueError` or drop the non-uniques.

Returns	Type	Description
`out`	`Categorical` or `Series` or array of integers if labels is set to `False`	The return type depends on the input: a Series of type `Category` if input is a `Series`, else `Categorical`. Bins are represented as categories when categorical data is returned.
`bins`	`ndarray` of floats	Only if `retbins` is set to `True`.

Basic Example

Let’s create a data frame that we will be using throughout the tutorial:

import pandas as pd

df = pd.DataFrame({'Competitor':['Alice', 'Mary', 'John', 'Ann', 'Bob', 'Jane', 'Tom', 'Vincent', 'Ella'],
                    'Score':[1,6,11,2,9,16,5,2,19]})
print(df)

	Competitor	Score
0	Alice	1
1	Mary	6
2	John	11
3	Ann	2
4	Bob	9
5	Jane	16
6	Tom	5
7	Vincent	2
8	Ella	19

We import the Pandas library and then we create a Pandas data frame which we assign to the variable “df“. The outputted data frame provides information about several competitors and a score that each competitor reached.

Now, we apply the qcut() function:

pd.qcut(x = df['Score'], q = 3)

0	(0.999, 4.0]
1	(4.0, 9.667]
2	(9.667, 19.0]
3	(0.999, 4.0]
4	(4.0, 9.667]
5	(9.667, 19.0]
6	(4.0, 9.667]
7	(0.999, 4.0]
8	(9.667, 19.0]

Name: Score, dtype: category

Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]

Inside the function, we put in “df['Score']” as the value for the parameter “x” to state that this is the column that we want to use to calculate the bins on. The second argument is “3” which we assign to the “q” parameter. This is the number of quantiles.

The output assigns each score to an interval. There are a few things to observe here.

First, we can see at the bottom of the output the intervals in order (“(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]“). The intervals start with parenthesis and end with square brackets. That means that the left value is not included in the interval, but the right one is. For example, “0.999” is not included, whereas “4.0” is included.

Additionally, we can see that the intervals do not have the same size. The first interval has a size of 3, the second has a size of 5.667 and the third one has a size of 9.333. Why are the intervals these particular sizes?

To answer that, we have to take a look at the number of values in each interval:

pd.qcut(x = df['Score'], q = 3).value_counts()

(0.999, 4.0]	3
(4.0, 9.667]	3
(9.667, 19.0]	3
`Name: score, dtype: int64`

We use the value_counts() function to achieve that. We can see that each bin has an equal amount of values. By assigning “3” to the “q” parameter we state that we want to get three intervals. And each interval should contain just as many values as the others. So, the interval sizes adjust to that.

To make it better visible which interval belongs to which score, we create a new column for the data frame:

df['Category'] = pd.qcut(x = df['Score'], q = 3)
print(df)

	Competitor	Score	Category
0	Alice	1	(0.999, 4.0]
1	Mary	6	(4.0, 9.667]
2	John	11	(9.667, 19.0]
3	Ann	2	(0.999, 4.0]
4	Bob	9	(4.0, 9.667]
5	Jane	16	(9.667, 19.0]
6	Tom	5	(4.0, 9.667]
7	Vincent	2	(0.999, 4.0]
8	Ella	19	(9.667, 19.0]

We create a new column called “Category” which contains the intervals and we add it to the existing data frame.

The “q” parameter

In the previous example, we set the “q” parameter equal to “3”. Of course, we can also assign other values here. Apart from an integer value, we can assign this parameter a list:

pd.qcut(x = df['Score'], q = [0, .25, .5, .75, 1.])

Output:

0	(0.999, 2.0]
1	(2.0, 6.0]
2	(6.0, 11.0]
3	(0.999, 2.0]
4	(6.0, 11.0]
5	(11.0, 19.0]
6	(2.0, 6.0]
7	(0.999, 2.0]
8	(11.0, 19.0]
Name: Score, dtype: category
Categories: (4, interval[float64, right]): [(0.999, 2.0] < (2.0, 6.0] < (6.0, 11.0] < (11.0, 19.0]]

This way, we directly determine how many percent of the values are included in each interval. For example, the first interval (0.999, 2.0] contains the first 25% of the score values. Since the intervals we created here all have the same length of 25%, we should get an equal amount of values in each interval.

Let’s see if that’s the case:

pd.qcut(x = df['Score'], q = [0, .25, .5, .75, 1.]).value_counts()

Output:

(0.999, 2.0]	3
(2.0, 6.0]	2
(6.0, 11.0]	2
(11.0, 19.0]	2
Name: Score, dtype: int64

We make use of the value_counts() function again. As we can see, the first interval contains one value more than the other ones. That’s because we have nine scores in total and nine cannot be divided by four. Consequently, the number of values per interval cannot be the same in all intervals.

The distance between the quantiles in the array does not have to be even:

pd.qcut(x = df['Score'], q = [0, .5, .7, .85, 1.])

Output:

0	(0.999, 6.0]
1	(0.999, 6.0]
2	(10.2, 15.0]
3	(0.999, 6.0]
4	(6.0, 10.2]
5	(15.0, 19.0]
6	(0.999, 6.0]
7	(0.999, 6.0]
8	(15.0, 19.0]
	
Name: Score, dtype: category
Categories: (4, interval[float64, right]): [(0.999, 6.0] < (6.0, 10.2] < (10.2, 15.0] < (15.0, 19.0]]

The first interval is way bigger than the other ones. Thus, the number of values per interval is not evenly distributed:

pd.qcut(x = df['Score'], q = [0, .5, .7, .85, 1.]).value_counts()

Output:

(0.999, 6.0]	5
(15.0, 19.0]	2
(6.0, 10.2]	1
(10.2, 15.0]	1
Name: Score, dtype: int64

As we can observe, the first interval contains the most score values.

Determine the Interval Precision

By now, the intervals we created all had a specific precision:

pd.qcut(x = df['Score'], q = 3)

Output:

0	(0.999, 4.0]
1	(4.0, 9.667]
2	(9.667, 19.0]
3	(0.999, 4.0]
4	(4.0, 9.667]
5	(9.667, 19.0]
6	(4.0, 9.667]
7	(0.999, 4.0]
8	(9.667, 19.0]
	
Name: Score, dtype: category
Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]

As we can see, there are three decimal places except for the integer values that only have “.0” as the decimal place.

We can change that precision using the “precision” parameter. This parameter expects an integer value which determines how many decimal places we want to get.

Let’s assign “5” here to get five decimal places:

pd.qcut(x = df['Score'], q = 3, precision=5)

Output:

0	(0.99999, 4.0]
1	(4.0, 9.66667]
2	(9.66667, 19.0]
3	(0.99999, 4.0]
4	(4.0, 9.66667]
5	(9.66667, 19.0]
6	(4.0, 9.66667]
7	(0.99999, 4.0]
8	(9.66667, 19.0]
		
Name: Score, dtype: category
Categories: (3, interval[float64, right]): [(0.99999, 4.0] < (4.0, 9.66667] < (9.66667, 19.0]]

In this manner, we create more precise intervals. How precise we should create them depends on the use case.

Print out the bins

If we want to print out the bins that we created, we apply the “retbins” parameter and set it to “True“:

pd.qcut(x = df['Score'],q = 3, retbins=True)

Output:

0	(0.999, 4.0]
1	(4.0, 9.667]
2	(9.667, 19.0]
3	(0.999, 4.0]
4	(4.0, 9.667]
5	(9.667, 19.0]
6	(4.0, 9.667]
7	(0.999, 4.0]
8	(9.667, 19.0]
	
Name: Score, dtype: category
Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]
array([1., 4., 9.66666667, 19.]))

The only difference here compared to when we did not apply the “retbins” parameter is the additional line “array” at the bottom of the output. Here, we get the resulting bins inside an array.

This can be useful especially when we assign the “q” parameter an integer as we did here instead of a list.

Define labels for the categories

We already saw how to create a new column to our data frame to see which score belongs to which interval:

df['Category'] = pd.qcut(x = df['Score'], q = 3)
print(df)

Output:

	Competitor	Score	Category
0	Alice	1	(0.999, 4.0]
1	Mary	6	(4.0, 9.667]
2	John	11	(9.667, 19.0]
3	Ann	2	(0.999, 4.0]
4	Bob	9	(4.0, 9.667]
5	Jane	16	(9.667, 19.0]
6	Tom	5	(4.0, 9.667]
7	Vincent	2	(0.999, 4.0]
8	Ella	19	(9.667, 19.0]

This way, we get a great overview of our data. However, assigning the intervals to the scores can be a bit confusing as we do not clearly see what a good score is and what isn’t.

This is where the “labels” parameter comes into play. We can give each interval a label to categorize our data:

df['Category'] = pd.qcut(x = df['Score'], q = 3, labels=['bad', 'good', 'exceptional'])
print(df)

Output:

	Competitor	Score	Category
0	Alice	1	bad
1	Mary	6	good
2	John	11	exceptional
3	Ann	2	bad
4	Bob	9	good
5	Jane	16	exceptional
6	Tom	5	good
7	Vincent	2	bad
8	Ella	19	exceptional

The “labels” parameter expects a list of the labels. We choose the labels "bad", "good", and "exceptional". So, the smallest interval is assigned the label "bad", the middle interval is assigned the label "good", and the biggest interval is assigned the label "exceptional".

Thus, we can categorize our data in a more user-friendly way.

Comparison with the cut() function

Chances are when you work with the qcut() function, you have come across the cut() function as well.

In this final section, we will see the difference between the qcut() and the cut() function.

Let’s refer to our initial example of the qcut() function where we assigned the “q” parameter the value “3”:

pd.qcut(x = df['Score'], q = 3)

Output:

0	(0.999, 4.0]
1	(4.0, 9.667]
2	(9.667, 19.0]
3	(0.999, 4.0]
4	(4.0, 9.667]
5	(9.667, 19.0]
6	(4.0, 9.667]
7	(0.999, 4.0]
8	(9.667, 19.0]
	
Name: Score, dtype: category
Categories (3, interval[float64, right]): [(0.999, 4.0] < (4.0, 9.667] < (9.667, 19.0]]

We created three quantiles in a way that each interval now contains the same amount of score values:

pd.qcut(x = df['Score'], q = 3).value_counts()

Output:

(0.999, 4.0]	3
(4.0, 9.667]	3
(9.667, 19.0]	3
Name: score, dtype: int64

Now we do essentially the same with the cut() function:

pd.cut(x = df['Score'], bins = 3)

Output:

0	(0.982, 7.0]
1	(0.982, 7.0]
2	(7.0, 13.0]
3	(0.982, 7.0]
4	(7.0, 13.0]
5	(13.0, 19.0]
6	(0.982, 7.0]
7	(0.982, 7.0]
8	(13.0, 19.0]
	
Name: Score, dtype: category
Categories: (3, interval[float64, right]): [(0.982, 7.0] < (7.0, 13.0] < (13.0, 19.0]]

The cut() function does not provide a “q” parameter, instead, it has the “bins” parameter which we also assign the value “3” to create three bins.

As we can see, the intervals are different from the ones from the qcut() function. Compared to the qcut() function, these intervals all have the same size. They are all six units long.

However, the number of values in each interval is different:

pd.cut(x = df['Score'], bins = 3).value_counts()

Output:

(0.982, 7.0]	5
(7.0, 13.0]	2
(13.0, 19.0]	2
Name: Score, dtype: int64

Thus, qcut() creates intervals that are not equally long but they all contain the same number of values. Whereas the cut() function creates equal-sized intervals that don’t necessarily have the same number of values in them.

Summary

In this tutorial, we learned about the qcut() function. We saw how to create intervals in several ways, how to determine the interval’s precision, how to label our categories, and we determined the differences to the cut() function.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!