Pandas factorize() - A Simple Guide with Video - Be on the Right Side of Change

In this tutorial, we will learn how to apply the Pandas function factorize(). This function encodes an object as an enumerated type and determines the unique values.

Here are the parameters from the official documentation:

Parameter	Type	Description
`values`	Sequence	A one-dimensional sequence. Sequences that aren’t Pandas objects are coerced to `ndarrays` before the factorization.
`sort`	`bool`, default: `False`	Sort the uniques and shuffle the codes to maintain the relationship.
`na_sentinel`	`int` or `None`, default: -1	Value to mark `NaN`-values. If set to “`None`“, it will not drop the `NaN` from the `uniques` of the values.
`size_hint`	`int`, optional	Hint to the hash table sizer.

Returns	Type	Description
`codes`	`ndarray`	An integer `ndarray` that’s an indexer into `uniques`.
`uniques`	`ndarray`, `Index`, or `Categorical`	The unique values. When the values are Categorical, `uniques` is a Categorical. When `values` is another Pandas object, an `Index` is returned. Otherwise, a one-dimensional `ndarray` is returned.

The Basic Functionality of factorize()

To get started, we will start with a coding example that explains how the factorize() function works:

import pandas as pd
codes, uniques = pd.factorize(['c', 'c', 'b', 'd', 'a', 'c', 'a'])

We import the Pandas library at first. Then, we apply the factorize function which we assign a list of characters. We set this function equal to the two variables “codes” and “uniques” because we will get two return values.

This is how the return values look like:

>>> codes
array([0, 0, 1, 2, 3, 0, 3], dtype=int64)
>>> uniques
array(['c', 'b', 'd', 'a'], dtype=object)

Variable codes is an array that contains the numeric values from the initial list.

The best way to see what these numeric values represent is when we put the numeric array below the initial list:

['c', 'c', 'b', 'd', 'a', 'c', 'a']
[0, 0, 1, 2, 3, 0, 3]

We observe that the numeric values are assigned to each unique character in the original list. Since "c" is the first value from the original list, it is assigned the numeric value “0” and so on.

💡 Remember: a computer program starts counting at “0”.

The data type for the “codes” array is “int64” because we get integer values exclusively.

Variable “uniques” shows the unique values from the initial list which are "c", "b", "d", and "a".

The unique values are presented in that order because they occur in that order in the initial list.

The “sort” Parameter

The list we put in the factorize() function in the previous section (['c', 'c', 'b', 'd', 'a', 'c', 'a']) represents some letters from the alphabet. However, the letters here are not ordered alphabetically.

When we apply the sort parameter, factorize() outputs the list in the same order but enumerates the characters in a sorted way:

>>> codes, uniques = pd.factorize(['c', 'c', 'b', 'd', 'a', 'c', 'a'], sort=True)
>>> codes
array([2, 2, 1, 3, 0, 2, 0], dtype=int64)

We perform the same factorize() function as before, but this time, we use the sort parameter and set it equal to True.

Variable codes now shows the array with the numbers for the unique characters being alphabetically ordered.

For example, the "c" is assigned the numeric value 2 because it is the third letter in the alphabet.

💡 Remember: computer programs start counting at 0, so 2 is the third value and not the second one.

The variable uniques now shows the unique values in an alphabetically sorted way:

>>> uniques
array(['a', 'b', 'c', 'd'], dtype=object)

Handling Missing Values

It might be the case that we have some missing values in our list that we want to perform the factorize() operation on.

We will change our initial list by replacing one character with a None value. Let’s see how the factorize() method handles this case:

>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'])
>>> codes
array([ 0, -1, 1, 2, 3, 0, 3], dtype=int64)
>>> uniques
array(['c', 'b', 'd', 'a'], dtype=object)

The second value in the initial list is None.

In the outputted codes array we can see that the None value gets assigned the numeric value -1.

The function’s parameter na_sentinel is used to handle missing values. And since we do not specify this parameter here, the function takes the parameter’s default value which is -1.

However, we can change this value by applying the na_sentinel parameter and assigning it a custom value:

>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'], na_sentinel=-10)
>>> codes
array([ 0, -10, 1, 2, 3, 0, 3], dtype=int64)
>>> uniques
array(['c', 'b', 'd', 'a'], dtype=object)

Here, the None value from the initial list was assigned the numeric value -10 because we set na_sentinel equal to -10.

In both examples, the uniques array was the same ['c', 'b', 'd', 'a'] because the None value does not count as a unique value.

We can also set the na_sentinel parameter equal to None:

>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'], na_sentinel=None)
>>> codes
array([0, 4, 1, 2, 3, 0, 3], dtype=int64)
>>> uniques
array(['c', 'b', 'd', 'a', nan], dtype=object)

Doing so, the None value in the initial list gets assigned the numeric value 4 in the codes array.

That’s because by setting the na_sentinel parameter equal to None we do not drop the None value, but we count it in.

Since the other characters "c", "b", "d", and "a" get the numeric values 0, 1, 2, and 3 respectively, the None value gets the next numeric value which is 4. Thus, in the uniques array, we can find the value nan after the other characters.

Factorizing Other Pandas Objects

By now, we have only factorized lists. When we factorize other Pandas objects, we get a different type for uniques:

>>> series = pd.Series(['a', 'b', 'a', 'd'])
>>> codes, uniques = pd.factorize(series)
>>> codes
array([0, 1, 0, 2], dtype=int64)
>>> uniques
Index(['a', 'b', 'd'], dtype='object')

Here, we factorize a Pandas series.

The resulting codes array is structured the same way as in the examples before since we get numeric representations for our characters combined in an array.

However, the uniques output has changed because the type of the output is now Index instead of “array” like in the examples above.

We can also factorize a Categorical object:

>>> category = pd.Categorical(['a', 'b', 'a', 'd'])
>>> codes, uniques = pd.factorize(category)
>>> codes
array([0, 1, 0, 2], dtype=int64)
>>> uniques
['a', 'b', 'd'] Categories (3, object): ['a', 'b', 'd']

Again, the codes array is from the type array just like before. But uniques is now from the type Categories.

One special thing about Categorical happens when we assign the parameter categories to it:

>>> category = pd.Categorical(['a', 'b', 'a', 'd'], categories=['a', 'b', 'c', 'd'])
>>> codes, uniques = pd.factorize(category)
>>> codes
array([0, 1, 0, 2], dtype=int64)
>>> uniques
['a', 'b', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']

We take the same characters for the factorization as in the two examples before.

But this time, we apply the categories parameter and assign it the list ['a', 'b', 'c', 'd'] to determine which categories we want to get.

As we can see, in the category list, there is a "c". However, there is no "c" in the list that gets factorized ['a', 'b', 'a', 'd'].

Variable codes remains unchanged, but uniques now has the c added to the Categories list although there is no c to be factorized.

Summary

All in all, we learned all about the Pandas function factorize() in this tutorial. We learned the basic functionality of this method, how to sort the values, how to handle missing values, and how to factorize different kinds of Pandas objects.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter email academy.

Happy Coding!