Pandas get_dummies() - A Simple Guide with Video - Be on the Right Side of Change

In this tutorial, we will learn all about the Pandas function get_dummies(). This method converts categorical data into dummy or indicator variables.

Here are the parameters from the official documentation:

Parameter	Type	Description
`data`	array-like, Series, or DataFrame	Data of which to get the dummy indicators.
`prefix`	`str`, list of `str`, or `dict` of `str`, default `None`	String to append to DataFrame column names. Pass a list with the length equal to the number of columns when calling `get_dummies()` on a DataFrame. Alternatively, `prefix` can be a dictionary mapping the column names to the prefixes.
`prefix_sep`	`str`, default ‘_’	Separator/delimiter to use if `prefix` is appended. Or pass a list or dictionary as with the prefix.
`dummy_na`	`bool`, default `False`	Add a column to indicate the `NaN` values, if `False`: `NaN` values are ignored.
`columns`	list-like, default `None`	Column names in the DataFrame to be encoded. If columns is `None`: all the columns with object or category `dtype` will be converted.
`sparse`	`bool`, default `False`	Whether the dummy-encoded columns should be backed by a `SparseArray` (`True`) or by a regular NumPy array (`False`).
`drop_first`	`bool`, default `False`	Whether to get k-1 dummies out of k categorical levels by removing the first level.
`dtype`	dtype, default `np.uint8`	Data type for the new columns. Only a single `dtype` is allowed.

Returns	Type	Description
	DataFrame	Dummy-coded data

The Basic Functionality of get_dummies()

We will start with a simple example to get to understand how and where we can apply the get_dummies() method and how exactly it works:

import pandas as pd
x = ['a', 'b', 'c', 'a', 'c']
pd.get_dummies(x)

Output:

	a	b	c
0	1	0	0
1	0	1	0
2	0	0	1
3	1	0	0
4	0	0	1

First, we import the Pandas library to be able to use the method.

Second, we create a simple Python list that contains several characters and we assign this list to the variable “x“.

Third, we apply the get_dummies() function and inside the function’s parenthesis, we put in the list “x” as the argument.

The output is a Pandas data frame.

The data frame consists of the columns “a“, “b“, and “c” and the rows “0“, “1“, “2“, “3“, and “4“. The cell entries are either “0” or “1“.

So, what exactly is happening here?

The column labels "a", "b", and "c" are the unique characters from the list that we applied (['a', 'b', 'c', 'a', 'c']).

The number of rows in the data frame equals the length of the list. There are five rows and five characters. The ones and zeros in the data frame are the actual dummy variables.

When we have a look at the first entry (column: “a“, row: “0“), we observe that this value is a “1“. That means that the first entry of the list is the character "a" because it is in row “0” (remember: a computer program starts counting at 0) and in column “a“.

Another example is the data frame entry in row “2” and column “c“: This entry is also “1” because in the list there is a "c" in third place.

Handling NaN Values

In this section, we will find out how the get_dummies() function handles NaN values.

For that reason, we create another Python list. This list contains the same values as the one from the first example, only the last character gets replaced with a NaN value:

import numpy as np
y = ['a', 'b', 'c', 'a', np.nan]
pd.get_dummies(y)

Output:

	a	b	c
0	1	0	0
1	0	1	0
2	0	0	1
3	1	0	0
4	0	0	0

The new list is assigned to the variable y.

As we can see, the list contains the unique values "a", "b", "c", and "np.nan". The latter is a NaN value that we created using the Numpy library which is why we had to import that library here.

The get_dummies() function creates a data frame just like in the first example.

Again, we get three columns "a", "b", and "c" and five rows. The only difference compared to the first example is the last row. Here, we have zeros exclusively. That’s because the last value from the list is a NaN value which we can’t assign to either "a", "b", or "c".

However, we can make the NaN value visible in the resulting data frame by applying the dummy_na parameter:

pd.get_dummies(y, dummy_na=True)

Output:

	a	b	c	NaN
0	1	0	0	0
1	0	1	0	0
2	0	0	1	0
3	1	0	0	0
4	0	0	0	1

We set this parameter to True. That way, we add another column with the label NaN.

In the resulting data frame, the last row’s NaN entry is now 1 because of the NaN value in the list.

Apply get_dummies() to a DataFrame

By now, we have seen how to apply the get_dummies() function on lists.

However, we can also apply this function to DataFrames. So, let’s create a simple data frame:

df = pd.DataFrame({'A': ['a', 'b', 'b'], 'B': ['a', 'c', 'b'], 'C': [1,2,3], 'D': [4,5,6]})
print(df)

	A	B	C	D
0	a	a	1	4
1	b	c	2	5
2	b	b	3	6

We get four columns "A", "B", "C", and "D" and three rows "0", "1", and "2". The columns "A" and "B" contain characters, whereas columns "C" and "D" contain integer values.

Now, we apply get_dummies() with this DataFrame:

pd.get_dummies(df)

Result:

	C	D	A_a	A_b	B_a	B_b	B_c
0	1	4	1	0	1	0	0
1	2	5	0	1	0	0	1
2	3	6	0	1	0	1	0

The columns "C" and "D" remain unchanged because only columns with either “object” or “category” data type will be converted.

We also get two "A_" columns and three "B_" columns. That’s because in the initial data frame there are only two unique values in column "A" and three unique values in column "B".

The ones and zeros in the resulting data frame are the dummy variables, just as in the examples above where we applied the get_dummies() function on lists.

For example, the "1" in the first row of the "A_a" column means that the first value from the "A" column in the initial data frame is the character "a".

The “columns” parameter

Especially in large data frames, it might be that we only want to convert specific columns instead of converting every possible column. Therefore, we use the “columns” parameter which we assign the labels of the columns that we want to convert.

We use the data frame again that we created in the previous section:

	A	B	C	D
0	a	a	1	4
1	b	c	2	5
2	b	b	3	6

But now, when applying the get_dummies() function, we add the “columns” parameter and assign it a list with the list entry "B" to state that we only want to get the dummy variables of this column:

pd.get_dummies(df, columns=['B'])

Result:

	A	C	D	B_a	B_b	B_c
0	a	1	4	1	0	0
1	b	2	5	0	0	1
2	b	3	6	0	1	0

The first three columns of the resulting data frame are the unchanged columns. They are the same as in the initial data frame.

The columns "C" and "D" are unchanged because they are neither from the “object” data type nor from the “category” data type.

And "A" remains unchanged because we did not add it to our “columns” parameter’s list.

The last three columns in the resulting data frame are the encoded variables from column "B".

By default, the columns parameter is set to None. This way, all columns with either “object” or “category” data type will be converted. We saw that in the previous examples where we did not set the columns parameter.

Changing the Prefixes

We can change the prefixes for our new columns in the resulting data frames by adding the prefix parameter.

Again, we use the data frame df for this purpose:

	A	B	C	D
0	a	a	1	4
1	b	c	2	5
2	b	b	3	6

Now, we perform the get_dummies() operation on this data frame and add the prefix parameter which we assign a list with the prefix labels for the converted columns.

This list should be the same length as the number of columns that get converted:

pd.get_dummies(df, prefix=['column1', 'column2'])

	C	D	column1_a	column1_b	column2_a	column2_b	column2_c
0	1	4	1	0	1	0	0
1	2	5	0	1	0	0	1
2	3	6	0	1	0	1	0

Since two columns get encoded ("A" and "B"), we apply two prefixes to the prefix parameter, "column1" and "column2".

The resulting data frame shows the new prefixes for the encoded columns.

If we want to, we can also change the prefix separator by adding the prefix_sep parameter:

pd.get_dummies(df, prefix=['column1', 'column2'], prefix_sep=':')

Result:

	C	D	column1:a	column1:b	column2:a	column2:b	column2:c
0	1	4	1	0	1	0	0
1	2	5	0	1	0	0	1
2	3	6	0	1	0	1	0

We perform the same get_dummies() operation as before, but we add the prefix_sep parameter and set it to ":".

By default, the separator is "_", but we can change it to whatever we want.

Summary

All in all, we learned all about the Pandas function get_dummies().

We learned the basic functionality of this method, how to handle NaN values, how to perform the function on data frames as well as lists, how to only encode specific columns, and how to set different prefixes.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter email academy.

Happy Coding!