In this tutorial, we will learn how to apply the Pandas function `factorize()`

. This function encodes an object as an ** enumerated type** and determines the unique values.

Here are the parameters from the official documentation:

Parameter | Type | Description |

`values` | Sequence | A one-dimensional sequence. Sequences that aren’t Pandas objects are coerced to `ndarrays` before the factorization. |

`sort` | `bool` , default: `False` | Sort the uniques and shuffle the codes to maintain the relationship. |

`na_sentinel` | `int` or `None` , default: -1 | Value to mark `NaN` -values. If set to “`None` “, it will not drop the `NaN` from the `uniques` of the values. |

`size_hint` | `int` , optional | Hint to the hash table sizer. |

Returns | Type | Description |

`codes` | `ndarray` | An integer `ndarray` that’s an indexer into `uniques` . |

`uniques` | `ndarray` , `Index` , or`Categorical` | The unique values. When the values are Categorical, `uniques` is a Categorical. When `values` is another Pandas object, an `Index` is returned. Otherwise, a one-dimensional `ndarray` is returned. |

## The Basic Functionality of factorize()

To get started, we will start with a coding example that explains how the `factorize()`

function works:

import pandas as pd codes, uniques = pd.factorize(['c', 'c', 'b', 'd', 'a', 'c', 'a'])

We import the Pandas library at first. Then, we apply the `factorize`

function which we assign a list of characters. We set this function equal to the two variables “`codes`

” and “`uniques`

” because we will get two return values.

This is how the return values look like:

>>> codes array([0, 0, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object)

Variable `codes`

is an array that contains the numeric values from the initial list.

The best way to see what these numeric values represent is when we put the numeric array below the initial list:

['c', 'c', 'b', 'd', 'a', 'c', 'a'] [0, 0, 1, 2, 3, 0, 3]

We observe that the numeric values are assigned to each unique character in the original list. Since `"c"`

is the first value from the original list, it is assigned the numeric value “`0`

” and so on.

💡 **Remember**: a computer program starts counting at “0”.

The data type for the “`codes`

” array is “`int64`

” because we get integer values exclusively.

Variable “`uniques`

” shows the unique values from the initial list which are `"c"`

, `"b"`

, `"d"`

, and `"a"`

.

The unique values are presented in that order because they occur in that order in the initial list.

## The “sort” Parameter

The list we put in the `factorize()`

function in the previous section (`['c', 'c', 'b', 'd', 'a', 'c', 'a']`

) represents some letters from the alphabet. However, the letters here are not ordered alphabetically.

When we apply the `sort`

parameter, `factorize()`

outputs the list in the same order but enumerates the characters in a sorted way:

>>> codes, uniques = pd.factorize(['c', 'c', 'b', 'd', 'a', 'c', 'a'], sort=True) >>> codes array([2, 2, 1, 3, 0, 2, 0], dtype=int64)

We perform the same `factorize()`

function as before, but this time, we use the `sort`

parameter and set it equal to `True`

.

Variable `codes`

now shows the array with the numbers for the unique characters being alphabetically ordered.

For example, the `"c"`

is assigned the numeric value `2`

because it is the third letter in the alphabet.

💡 **Remember**: computer programs start counting at 0, so 2 is the third value and not the second one.

The variable `uniques`

now shows the unique values in an alphabetically sorted way:

>>> uniques array(['a', 'b', 'c', 'd'], dtype=object)

## Handling Missing Values

It might be the case that we have some missing values in our list that we want to perform the `factorize()`

operation on.

We will change our initial list by replacing one character with a `None`

value. Let’s see how the `factorize()`

method handles this case:

>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a']) >>> codes array([ 0, -1, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object)

The second value in the initial list is `None`

.

In the outputted `codes`

array we can see that the `None`

value gets assigned the numeric value `-1`

.

The function’s parameter `na_sentinel`

is used to handle missing values. And since we do not specify this parameter here, the function takes the parameter’s default value which is `-1`

.

However, we can change this value by applying the `na_sentinel`

parameter and assigning it a custom value:

>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'], na_sentinel=-10) >>> codes array([ 0, -10, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object)

Here, the `None`

value from the initial list was assigned the numeric value `-10`

because we set `na_sentinel`

equal to `-10`

.

In both examples, the `uniques`

array was the same `['c', 'b', 'd', 'a']`

because the `None`

value does not count as a unique value.

We can also set the `na_sentinel`

parameter equal to `None`

:

>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'], na_sentinel=None) >>> codes array([0, 4, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a', nan], dtype=object)

Doing so, the `None`

value in the initial list gets assigned the numeric value `4`

in the `codes`

array.

That’s because by setting the `na_sentinel`

parameter equal to `None`

we do not drop the `None`

value, but we count it in.

Since the other characters `"c"`

, `"b"`

, `"d"`

, and `"a"`

get the numeric values 0, 1, 2, and 3 respectively, the `None`

value gets the next numeric value which is 4. Thus, in the `uniques`

array, we can find the value `nan`

after the other characters.

## Factorizing Other Pandas Objects

By now, we have only factorized lists. When we factorize other Pandas objects, we get a different type for `uniques`

:

>>> series = pd.Series(['a', 'b', 'a', 'd']) >>> codes, uniques = pd.factorize(series) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques Index(['a', 'b', 'd'], dtype='object')

Here, we factorize a Pandas series.

The resulting `codes`

array is structured the same way as in the examples before since we get numeric representations for our characters combined in an array.

However, the `uniques`

output has changed because the type of the output is now `Index`

instead of “array” like in the examples above.

We can also factorize a `Categorical`

object:

>>> category = pd.Categorical(['a', 'b', 'a', 'd']) >>> codes, uniques = pd.factorize(category) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques ['a', 'b', 'd'] Categories (3, object): ['a', 'b', 'd']

Again, the `codes`

array is from the type `array`

just like before. But `uniques`

is now from the type `Categories`

.

One special thing about `Categorical`

happens when we assign the parameter `categories`

to it:

>>> category = pd.Categorical(['a', 'b', 'a', 'd'], categories=['a', 'b', 'c', 'd']) >>> codes, uniques = pd.factorize(category) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques ['a', 'b', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']

We take the same characters for the factorization as in the two examples before.

But this time, we apply the `categories`

parameter and assign it the list `['a', 'b', 'c', 'd']`

to determine which categories we want to get.

As we can see, in the category list, there is a `"c"`

. However, there is no `"c"`

in the list that gets factorized `['a', 'b', 'a', 'd']`

.

Variable `codes`

remains unchanged, but `uniques`

now has the `c`

added to the `Categories`

list although there is no `c`

to be factorized.

## Summary

All in all, we learned all about the Pandas function `factorize()`

in this tutorial. We learned the basic functionality of this method, how to sort the values, how to handle missing values, and how to factorize different kinds of Pandas objects.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter email academy.

Happy Coding!