In this tutorial, we will learn how to apply the Pandas function factorize()
. This function encodes an object as an enumerated type and determines the unique values.
Here are the parameters from the official documentation:
Parameter | Type | Description |
values | Sequence | A one-dimensional sequence. Sequences that aren’t Pandas objects are coerced to ndarrays before the factorization. |
sort | bool , default: False | Sort the uniques and shuffle the codes to maintain the relationship. |
na_sentinel | int or None , default: -1 | Value to mark NaN -values. If set to “None “, it will not drop the NaN from the uniques of the values. |
size_hint | int , optional | Hint to the hash table sizer. |
Returns | Type | Description |
codes | ndarray | An integer ndarray that’s an indexer into uniques . |
uniques | ndarray , Index , orCategorical | The unique values. When the values are Categorical, uniques is a Categorical. When values is another Pandas object, an Index is returned. Otherwise, a one-dimensional ndarray is returned. |
The Basic Functionality of factorize()
To get started, we will start with a coding example that explains how the factorize()
function works:
import pandas as pd codes, uniques = pd.factorize(['c', 'c', 'b', 'd', 'a', 'c', 'a'])
We import the Pandas library at first. Then, we apply the factorize
function which we assign a list of characters. We set this function equal to the two variables “codes
” and “uniques
” because we will get two return values.
This is how the return values look like:
>>> codes array([0, 0, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object)
Variable codes
is an array that contains the numeric values from the initial list.
The best way to see what these numeric values represent is when we put the numeric array below the initial list:
['c', 'c', 'b', 'd', 'a', 'c', 'a'] [0, 0, 1, 2, 3, 0, 3]
We observe that the numeric values are assigned to each unique character in the original list. Since "c"
is the first value from the original list, it is assigned the numeric value “0
” and so on.
💡 Remember: a computer program starts counting at “0”.
The data type for the “codes
” array is “int64
” because we get integer values exclusively.
Variable “uniques
” shows the unique values from the initial list which are "c"
, "b"
, "d"
, and "a"
.
The unique values are presented in that order because they occur in that order in the initial list.
The “sort” Parameter
The list we put in the factorize()
function in the previous section (['c', 'c', 'b', 'd', 'a', 'c', 'a']
) represents some letters from the alphabet. However, the letters here are not ordered alphabetically.
When we apply the sort
parameter, factorize()
outputs the list in the same order but enumerates the characters in a sorted way:
>>> codes, uniques = pd.factorize(['c', 'c', 'b', 'd', 'a', 'c', 'a'], sort=True) >>> codes array([2, 2, 1, 3, 0, 2, 0], dtype=int64)
We perform the same factorize()
function as before, but this time, we use the sort
parameter and set it equal to True
.
Variable codes
now shows the array with the numbers for the unique characters being alphabetically ordered.
For example, the "c"
is assigned the numeric value 2
because it is the third letter in the alphabet.
💡 Remember: computer programs start counting at 0, so 2 is the third value and not the second one.
The variable uniques
now shows the unique values in an alphabetically sorted way:
>>> uniques array(['a', 'b', 'c', 'd'], dtype=object)
Handling Missing Values
It might be the case that we have some missing values in our list that we want to perform the factorize()
operation on.
We will change our initial list by replacing one character with a None
value. Let’s see how the factorize()
method handles this case:
>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a']) >>> codes array([ 0, -1, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object)
The second value in the initial list is None
.
In the outputted codes
array we can see that the None
value gets assigned the numeric value -1
.
The function’s parameter na_sentinel
is used to handle missing values. And since we do not specify this parameter here, the function takes the parameter’s default value which is -1
.
However, we can change this value by applying the na_sentinel
parameter and assigning it a custom value:
>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'], na_sentinel=-10) >>> codes array([ 0, -10, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a'], dtype=object)
Here, the None
value from the initial list was assigned the numeric value -10
because we set na_sentinel
equal to -10
.
In both examples, the uniques
array was the same ['c', 'b', 'd', 'a']
because the None
value does not count as a unique value.
We can also set the na_sentinel
parameter equal to None
:
>>> codes, uniques = pd.factorize(['c', None, 'b', 'd', 'a', 'c', 'a'], na_sentinel=None) >>> codes array([0, 4, 1, 2, 3, 0, 3], dtype=int64) >>> uniques array(['c', 'b', 'd', 'a', nan], dtype=object)
Doing so, the None
value in the initial list gets assigned the numeric value 4
in the codes
array.
That’s because by setting the na_sentinel
parameter equal to None
we do not drop the None
value, but we count it in.
Since the other characters "c"
, "b"
, "d"
, and "a"
get the numeric values 0, 1, 2, and 3 respectively, the None
value gets the next numeric value which is 4. Thus, in the uniques
array, we can find the value nan
after the other characters.
Factorizing Other Pandas Objects
By now, we have only factorized lists. When we factorize other Pandas objects, we get a different type for uniques
:
>>> series = pd.Series(['a', 'b', 'a', 'd']) >>> codes, uniques = pd.factorize(series) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques Index(['a', 'b', 'd'], dtype='object')
Here, we factorize a Pandas series.
The resulting codes
array is structured the same way as in the examples before since we get numeric representations for our characters combined in an array.
However, the uniques
output has changed because the type of the output is now Index
instead of “array” like in the examples above.
We can also factorize a Categorical
object:
>>> category = pd.Categorical(['a', 'b', 'a', 'd']) >>> codes, uniques = pd.factorize(category) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques ['a', 'b', 'd'] Categories (3, object): ['a', 'b', 'd']
Again, the codes
array is from the type array
just like before. But uniques
is now from the type Categories
.
One special thing about Categorical
happens when we assign the parameter categories
to it:
>>> category = pd.Categorical(['a', 'b', 'a', 'd'], categories=['a', 'b', 'c', 'd']) >>> codes, uniques = pd.factorize(category) >>> codes array([0, 1, 0, 2], dtype=int64) >>> uniques ['a', 'b', 'd'] Categories (4, object): ['a', 'b', 'c', 'd']
We take the same characters for the factorization as in the two examples before.
But this time, we apply the categories
parameter and assign it the list ['a', 'b', 'c', 'd']
to determine which categories we want to get.
As we can see, in the category list, there is a "c"
. However, there is no "c"
in the list that gets factorized ['a', 'b', 'a', 'd']
.
Variable codes
remains unchanged, but uniques
now has the c
added to the Categories
list although there is no c
to be factorized.
Summary
All in all, we learned all about the Pandas function factorize()
in this tutorial. We learned the basic functionality of this method, how to sort the values, how to handle missing values, and how to factorize different kinds of Pandas objects.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter email academy.
Happy Coding!