Pandas cut() – A Simple Guide with Video

Rate this post

In this tutorial, we will learn about the Pandas cut() function. This function bins values into separate intervals. It is mainly used to analyze scalar data.

Syntax and Documentation

Here are the parameters from the official documentation:

ParameterTypeDescription
xarray-likeThe one-dimensional input array to be binned.
binsint, sequence of scalars, or
IntervalIndex
The criteria to bin by.

int: the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

sequence of scalars: the bin edges allowing for non-uniform width. Doesn’t extend the range of x.

IntervalIndex: the exact bins to be used. Must be non-overlapping for bins.
rightbool, default TrueDoes argument bins include the rightmost edge? If right == True (default), bins [1, 2, 3, 4] indicate intervals (1,2], (2,3], (3,4].
Ignored when bins is an IntervalIndex.
labelsarray or False, default NoneSpecifies the labels for the returned bins. Must be the same length as the resulting bins.
– If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex.
retbinsbool, default FalseTo return the bins or not? Useful if bins is a scalar.
precisionint, default 3Precision at which to store and display the bins labels.
include_lowestbool, default FalseWhether the first interval should be left-inclusive or not.
duplicates{default ‘raise’, ‘drop’}, optionalIf bin edges are not unique, raise ValueError or drop non-uniques.
orderedbool, default TrueWhether the labels are ordered or not. Applies to returned types Categorical and Series (with Categorical dtype).
– If True, the resulting categorical will be ordered.
– If False, the resulting categorical will be unordered and labels must be provided.
ReturnsTypeDescription
outCategorical, Series, or ndarrayAn array-like object representing the respective bin for each value of x. The type depends on the value of labels.

True (default): returns a Series for Series x or a Categorical for all other inputs. The values stored within are Interval dtype.

sequence of scalars: returns a Series for Series x or a Categorical for all other inputs. The values stored within are whatever the type in the sequence is.

False: returns an ndarray of integers.
binsnumpy.ndarray or IntervalIndexThe computed or specified bins. Only returned when retbins=True.
– For scalar or sequence bins, this is an ndarray with the computed bins.
– If set duplicates=drop, bins will drop non-unique bin.
– For an IntervalIndex bins, this is equal to bins.

Basic Example

To get to know the cut() function, we will start with an introductory example, on which we will build in the following sections:

import pandas as pd

df = pd.DataFrame({'Diver': ['Dave', 'Alice', 'Mary', 
                             'John', 'Jane', 'Bob'], 
                   'Score': [1,6,4,8,5,10]})
print(df)
DiverScore
0Dave1
1Alice6
2Mary4
3John8
4Jane5
5Bob10

First, we import the Pandas library. Then we create a Pandas data frame with two columns. A “Diver” column with string values and a “Score” column with integer values.

The outputted data frame shows a dataset with six different divers and their respective score values.

Now, we apply the cut() function:

pd.cut(x = df['Score '], bins = 3)
0(0.991, 4.0]
1(4.0, 7.0]
2(0.991, 4.0]
3(7.0, 10.0]
4(4.0, 7.0]
5(7.0, 10.0]
Name: Score, dtype: category
Categories: (3, interval[float64, right]): [(0.991, 4.0] < (4.0, 7.0] < (7.0, 10.0]]

The cut() function provides lots of parameters. Two of those are mandatory to apply.

  • The first one is the parameter “x that expects a list that we want to bin. In the example, we apply the “Score” column from our data frame.
  • The second necessary parameter is “bins“. This one expects the number of bins as an integer value or a list of the interval values. In the example, we assign “3” to the “bins” parameter to state that we want to create three equal-sized intervals.

The output shows the interval for each score. For example, Alice’s score is “6” and is assigned the interval “(4.0, 7.0]” because 6 lies within this range.

But how were these intervals calculated? By assigning the “bins” parameter the value “3” we state that we want three equal-sized intervals. The intervals are calculated like this: we take the maximum value of the scores (which is “10”) and the minimum value (which is “1”). We subtract these values (10 – 1 = 9) and divide that by the number of intervals which we defined as “3” (9 / 3 = 3).

In short: (maximum value – minimum value) / number of intervals.

That way, we get the size of an interval which is 3 in our example. We already looked at Alice’s score which is 6 and lies in the interval “(4.0, 7.0]”. We can see that the difference between 7.0 and 4.0 is indeed 3.

But why does the lowest interval not start with “1.0” but with “0.991” although the lowest value is 1? That’s because of the meaning of the brackets in the intervals. The intervals here are half-open intervals. The interval “(0.991, 4.0]” means the values included are greater than 0.991 and less than or equal to 4.0. If the interval looked like this: “(1.0, 4.0]”, the value “1” would not be included in that interval.

The output also shows the order of the intervals.

To make it better visible which category belongs to which score, we can create a new column and add it to the data frame:

df['Interval'] = pd.cut(x = df['Score'], bins = 3)
DiverScoreInterval
0Dave1(0.991, 4.0]
1Alice6(4.0, 7.0]
2Mary4(0.991, 4.0]
3John8(7.0, 10.0]
4Jane5(4.0, 7.0]
5Bob10(7.0, 10.0]

We applied the cut() function the same way as before. But this time, we assigned it to a new column labeled “Interval”. The outputted data frame now shows all divers, scores, and the respective intervals in a clear way.

Change the Intervals

In the previous section, we applied the cut() function using three intervals by assigning the “bins” parameter the value “3”.

Let’s now assign the “bins” parameter another value, for example, “5”:

df['Interval'] = pd.cut(x = df['Score'], bins = 5)
DiverScoreInterval
0Dave1(0.991, 2.8]
1Alice6(4.6, 6.4]
2Mary4(2.8, 4.6]
3John8(6.4, 8.2]
4Jane5(4.6, 6.4]
5Bob10(8.2, 10.0]

As before, we create an “Interval” column and assign it to the initial data frame to see immediately which score is assigned to which interval.

The only thing we change here is that we set the “bins” parameter equal to “5”. That way, we now have five equal-sized intervals. The length of each interval is calculated as follows:

(maximum value – minimum value) / number of intervals => (10 – 1) / 5 = 1.8

As we can see, each interval has indeed the length 1.8, except for the lowest interval “(0.991, 2.8]”. It starts at “0.991”, just like in the previous section, because we have half-open intervals and that way, the value “1” is included in this interval.

Apart from an integer value, we can also assign the “bins” parameter a list of scalar values. This way, we determine the interval boundaries directly:

df['Interval'] = pd.cut(x = df['Score'], bins=[0,2,4,6,8,10])
DiverScoreInterval
0Dave1(0, 2]
1Alice6(4, 6]
2Mary4(2, 4]
3John8(6, 8]
4Jane5(4, 6]
5Bob10(8, 10]

The list “[0,2,4,6,8,10]” creates the intervals: “(0,2]”, “(2,4]”, “(4,6]”, “(6,8]”, and “(8,10]”.

This way, we specify how many intervals we want to get and how long each interval should be.

In this example, we created intervals that all have the same length. However, this does not have to be the case. We can stipulate the interval lengths in any way we want:

df['Interval'] = pd.cut(x = df['Score'], bins=[0,4,5,6,10])
DiverScoreInterval
0Dave1(0, 4]
1Alice6(5, 6]
2Mary4(0, 4]
3John8(6, 10]
4Jane5(4, 5]
5Bob10(6, 10]

Here, we assigned the “bins” parameter a different list. The resulting intervals do not all have the same length.

We might define intervals using the “bins” parameter and some values from the data frame do not lie in any determined interval:

df['Interval'] = pd.cut(x = df['Score'], bins=[0,4,5,6])
DiverScoreInterval
0Dave1(0.0, 4.0]
1Alice6(5.0, 6.0]
2Mary4(0.0, 4.0]
3John8NaN
4Jane5(4.0, 5.0]
5Bob10NaN

The scores “8” and “10” do not lie within any given interval. Pandas handles these cases by assigning these values the interval “NaN“. When that happens, we know that our intervals do not cover the whole field.

Include the Leftmost or the Rightmost Edge

In the examples we saw by now, the intervals were always structured like this: “(x, y]”. That way, the rightmost edge is included in the interval. That’s because the “right” parameter from the cut() function is by default set to “True“.

If we change that parameter and set it to “False“, this is what happens:

df['Interval'] = pd.cut(x = df['Score'], bins = 3, right=False)
DiverScoreInterval
0Dave1[1.0, 4.0)
1Alice6[4.0, 7.0)
2Mary4[4.0, 7.0)
3John8[7.0, 10.009)
4Jane5[4.0, 7.0)
5Bob10[7.0, 10.009)

We set the “bins” parameter to “3” like in the first example, so we get three equal-sized intervals. But now, the intervals are structured like this: “[x, y)”. The leftmost edge is now included in the interval and not the rightmost.

Thus, the smallest interval now looks like this “[1.0, 4.0)”, instead of this “(0.991, 4.0]”. The value “1” is now included in the interval.

Hence, the biggest interval now occurs like this “[7.0, 10.009)”. It has to be that way, so the value “10” is included in this interval.

Label the Intervals

We can label the intervals using the “labels” parameter of the cut() function. This way, we can categorize each score:

df['Interval'] = pd.cut(x = df['Score'], bins = 3, 
                        labels=['bad', 'good', 'exceptional'])
DiverScoreInterval
0Dave1bad
1Alice6good
2Mary4bad
3John8exceptional
4Jane5good
5Bob10exceptional

Again, we created three equal-sized intervals. But this time, we labeled each interval. The smallest interval is labeled “bad”, the middle interval is labeled “good”, and the biggest interval is labeled “exceptional”.

By doing that, we categorize and evaluate each score.

Include the Lowest Value

Imagine, we create the following intervals:

df['Interval'] = pd.cut(x = df['Score'], bins = [1,3,5,7,9,11])
DiverScoreInterval
0Dave1NaN
1Alice6(5.0, 7.0]
2Mary4(3.0, 5.0]
3John8(7.0, 9.0]
4Jane5(3.0, 5.0]
5Bob10(9.0, 11.0]

We can see that Dave’s score is not included in any interval. That’s because the “right” parameter is set to “True” by default which does not include the leftmost edge. Thus, the score “1” is not included in the interval “(1.0, 3.0]”.

What do we do to include the score “1” in the interval while not changing the “right” parameter because we want to keep the interval structure with the rightmost edge remaining included in the intervals?

We achieve that by applying the “include_lowest” parameter. By assigning that parameter the value “True“, we include the lowest value:

df['Interval'] = pd.cut(x = df['Score'], bins = [1,3,5,7,9,11], 
                        include_lowest=True)
DiverScoreInterval
0Dave1(0.999, 3.0]
1Alice6(5.0, 7.0]
2Mary4(3.0, 5.0]
3John8(7.0, 9.0]
4Jane5(3.0, 5.0]
5Bob10(9.0, 11.0]

Now, the value “1” is included in an interval.

Summary

All in all, the cut() function provides us with a lot of possibilities. We can create various intervals, change the interval’s structures and label the intervals to categorize our data.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!