Pandas get_dummies() – A Simple Guide with Video

Rate this post

In this tutorial, we will learn all about the Pandas function get_dummies(). This method converts categorical data into dummy or indicator variables.

Here are the parameters from the official documentation:

ParameterTypeDescription
dataarray-like, Series, or DataFrameData of which to get the dummy indicators.
prefixstr, list of str, or dict of str,
default None
String to append to DataFrame column names. Pass a list with the length equal to the number of columns when calling get_dummies() on a DataFrame. Alternatively, prefix can be a dictionary mapping the column names to the prefixes.
prefix_sepstr, default ‘_’Separator/delimiter to use if prefix is appended. Or pass a list or dictionary as with the prefix.
dummy_nabool, default FalseAdd a column to indicate the NaN values, if False: NaN values are ignored.
columnslist-like, default NoneColumn names in the DataFrame to be encoded. If 
columns is None: all the columns with object or category dtype will be converted.
sparsebool, default FalseWhether the dummy-encoded columns should be backed by a SparseArray (True) or by a regular NumPy array (False).
drop_firstbool, default FalseWhether to get k-1 dummies out of k categorical levels by removing the first level.
dtypedtype, default np.uint8Data type for the new columns. Only a single dtype is allowed.
ReturnsTypeDescription
DataFrameDummy-coded data

The Basic Functionality of get_dummies()

We will start with a simple example to get to understand how and where we can apply the get_dummies() method and how exactly it works:

import pandas as pd
x = ['a', 'b', 'c', 'a', 'c']
pd.get_dummies(x)

Output:

abc
0100
1010
2001
3100
4001

First, we import the Pandas library to be able to use the method.

Second, we create a simple Python list that contains several characters and we assign this list to the variable “x“.

Third, we apply the get_dummies() function and inside the function’s parenthesis, we put in the list “x” as the argument.

The output is a Pandas data frame.

The data frame consists of the columns “a“, “b“, and “c” and the rows “0“, “1“, “2“, “3“, and “4“. The cell entries are either “0” or “1“.

So, what exactly is happening here?

The column labels "a", "b", and "c" are the unique characters from the list that we applied (['a', 'b', 'c', 'a', 'c']).

The number of rows in the data frame equals the length of the list. There are five rows and five characters. The ones and zeros in the data frame are the actual dummy variables.

When we have a look at the first entry (column: “a“, row: “0“), we observe that this value is a “1“. That means that the first entry of the list is the character "a" because it is in row “0” (remember: a computer program starts counting at 0) and in column “a“.

Another example is the data frame entry in row “2” and column “c“: This entry is also “1” because in the list there is a "c" in third place.   

Handling NaN Values

In this section, we will find out how the get_dummies() function handles NaN values.

For that reason, we create another Python list. This list contains the same values as the one from the first example, only the last character gets replaced with a NaN value:

import numpy as np
y = ['a', 'b', 'c', 'a', np.nan]
pd.get_dummies(y)

Output:

abc
0100
1010
2001
3100
4000

The new list is assigned to the variable y.

As we can see, the list contains the unique values "a", "b", "c", and "np.nan". The latter is a NaN value that we created using the Numpy library which is why we had to import that library here.

The get_dummies() function creates a data frame just like in the first example.

Again, we get three columns "a", "b", and "c" and five rows. The only difference compared to the first example is the last row. Here, we have zeros exclusively. That’s because the last value from the list is a NaN value which we can’t assign to either "a", "b", or "c".

However, we can make the NaN value visible in the resulting data frame by applying the dummy_na parameter:

pd.get_dummies(y, dummy_na=True)

Output:

abcNaN
01000
10100
20010
31000
40001

We set this parameter to True. That way, we add another column with the label NaN.

In the resulting data frame, the last row’s NaN entry is now 1 because of the NaN value in the list.

Apply get_dummies() to a DataFrame

By now, we have seen how to apply the get_dummies() function on lists.

However, we can also apply this function to DataFrames. So, let’s create a simple data frame:

df = pd.DataFrame({'A': ['a', 'b', 'b'], 'B': ['a', 'c', 'b'], 'C': [1,2,3], 'D': [4,5,6]})
print(df)
ABCD
0aa14
1bc25
2bb36

We get four columns "A", "B", "C", and "D" and three rows "0", "1", and "2". The columns "A" and "B" contain characters, whereas columns "C" and "D" contain integer values.

Now, we apply get_dummies() with this DataFrame:

pd.get_dummies(df)

Result:

CDA_aA_bB_aB_bB_c
01410100
12501001
23601010

The columns "C" and "D" remain unchanged because only columns with either “object” or “category” data type will be converted.

We also get two "A_" columns and three "B_" columns. That’s because in the initial data frame there are only two unique values in column "A" and three unique values in column "B".

The ones and zeros in the resulting data frame are the dummy variables, just as in the examples above where we applied the get_dummies() function on lists.

For example, the "1" in the first row of the "A_a" column means that the first value from the "A" column in the initial data frame is the character "a".

The “columns” parameter

Especially in large data frames, it might be that we only want to convert specific columns instead of converting every possible column. Therefore, we use the “columns” parameter which we assign the labels of the columns that we want to convert.

We use the data frame again that we created in the previous section:

ABCD
0aa14
1bc25
2bb36

But now, when applying the get_dummies() function, we add the “columns” parameter and assign it a list with the list entry "B" to state that we only want to get the dummy variables of this column:

pd.get_dummies(df, columns=['B'])

Result:

ACDB_aB_bB_c
0a14100
1b25001
2b36010

The first three columns of the resulting data frame are the unchanged columns. They are the same as in the initial data frame.

The columns "C" and "D" are unchanged because they are neither from the “object” data type nor from the “category” data type.

And "A" remains unchanged because we did not add it to our “columns” parameter’s list.

The last three columns in the resulting data frame are the encoded variables from column "B".

By default, the columns parameter is set to None. This way, all columns with either “object” or “category” data type will be converted. We saw that in the previous examples where we did not set the columns parameter.

Changing the Prefixes

We can change the prefixes for our new columns in the resulting data frames by adding the prefix parameter.

Again, we use the data frame df for this purpose:

ABCD
0aa14
1bc25
2bb36

Now, we perform the get_dummies() operation on this data frame and add the prefix parameter which we assign a list with the prefix labels for the converted columns.

This list should be the same length as the number of columns that get converted:

pd.get_dummies(df, prefix=['column1', 'column2'])
CDcolumn1_acolumn1_bcolumn2_acolumn2_bcolumn2_c
01410100
12501001
23601010

Since two columns get encoded ("A" and "B"), we apply two prefixes to the prefix parameter, "column1" and "column2".

The resulting data frame shows the new prefixes for the encoded columns.

If we want to, we can also change the prefix separator by adding the prefix_sep parameter:

pd.get_dummies(df, prefix=['column1', 'column2'], prefix_sep=':')

Result:

CDcolumn1:acolumn1:bcolumn2:acolumn2:bcolumn2:c
01410100
12501001
23601010

We perform the same get_dummies() operation as before, but we add the prefix_sep parameter and set it to ":".

By default, the separator is "_", but we can change it to whatever we want.

Summary

All in all, we learned all about the Pandas function get_dummies().

We learned the basic functionality of this method, how to handle NaN values, how to perform the function on data frames as well as lists, how to only encode specific columns, and how to set different prefixes.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter email academy.

Happy Coding!