In this tutorial, we will learn all about the Pandas function get_dummies()
. This method converts categorical data into dummy or indicator variables.
Here are the parameters from the official documentation:
Parameter | Type | Description |
data | array-like, Series, or DataFrame | Data of which to get the dummy indicators. |
prefix | str , list of str , or dict of str ,default None | String to append to DataFrame column names. Pass a list with the length equal to the number of columns when calling get_dummies() on a DataFrame. Alternatively, prefix can be a dictionary mapping the column names to the prefixes. |
prefix_sep | str , default ‘_’ | Separator/delimiter to use if prefix is appended. Or pass a list or dictionary as with the prefix. |
dummy_na | bool , default False | Add a column to indicate the NaN values, if False : NaN values are ignored. |
columns | list-like, default None | Column names in the DataFrame to be encoded. If columns is None : all the columns with object or category dtype will be converted. |
sparse | bool , default False | Whether the dummy-encoded columns should be backed by a SparseArray (True ) or by a regular NumPy array (False ). |
drop_first | bool , default False | Whether to get k-1 dummies out of k categorical levels by removing the first level. |
dtype | dtype, default np.uint8 | Data type for the new columns. Only a single dtype is allowed. |
Returns | Type | Description |
DataFrame | Dummy-coded data |
The Basic Functionality of get_dummies()
We will start with a simple example to get to understand how and where we can apply the get_dummies()
method and how exactly it works:
import pandas as pd x = ['a', 'b', 'c', 'a', 'c'] pd.get_dummies(x)
Output:
a | b | c | |
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 1 |
First, we import the Pandas library to be able to use the method.
Second, we create a simple Python list that contains several characters and we assign this list to the variable “x
“.
Third, we apply the get_dummies()
function and inside the function’s parenthesis, we put in the list “x
” as the argument.
The output is a Pandas data frame.
The data frame consists of the columns “a
“, “b
“, and “c
” and the rows “0
“, “1
“, “2
“, “3
“, and “4
“. The cell entries are either “0
” or “1
“.
So, what exactly is happening here?
The column labels "a"
, "b"
, and "c"
are the unique characters from the list that we applied (['a', 'b', 'c', 'a', 'c']
).
The number of rows in the data frame equals the length of the list. There are five rows and five characters. The ones and zeros in the data frame are the actual dummy variables.
When we have a look at the first entry (column: “a
“, row: “0
“), we observe that this value is a “1
“. That means that the first entry of the list is the character "a"
because it is in row “0” (remember: a computer program starts counting at 0) and in column “a
“.
Another example is the data frame entry in row “2
” and column “c
“: This entry is also “1
” because in the list there is a "c"
in third place.
Handling NaN Values
In this section, we will find out how the get_dummies()
function handles NaN
values.
For that reason, we create another Python list. This list contains the same values as the one from the first example, only the last character gets replaced with a NaN
value:
import numpy as np y = ['a', 'b', 'c', 'a', np.nan] pd.get_dummies(y)
Output:
a | b | c | |
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 0 |
The new list is assigned to the variable y
.
As we can see, the list contains the unique values "a"
, "b"
, "c"
, and "np.nan"
. The latter is a NaN value that we created using the Numpy library which is why we had to import that library here.
The get_dummies()
function creates a data frame just like in the first example.
Again, we get three columns "a"
, "b"
, and "c"
and five rows. The only difference compared to the first example is the last row. Here, we have zeros exclusively. That’s because the last value from the list is a NaN
value which we can’t assign to either "a"
, "b"
, or "c"
.
However, we can make the NaN
value visible in the resulting data frame by applying the dummy_na
parameter:
pd.get_dummies(y, dummy_na=True)
Output:
a | b | c | NaN | |
0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 1 |
We set this parameter to True
. That way, we add another column with the label NaN
.
In the resulting data frame, the last row’s NaN
entry is now 1
because of the NaN
value in the list.
Apply get_dummies() to a DataFrame
By now, we have seen how to apply the get_dummies()
function on lists.
However, we can also apply this function to DataFrames. So, let’s create a simple data frame:
df = pd.DataFrame({'A': ['a', 'b', 'b'], 'B': ['a', 'c', 'b'], 'C': [1,2,3], 'D': [4,5,6]}) print(df)
A | B | C | D | |
0 | a | a | 1 | 4 |
1 | b | c | 2 | 5 |
2 | b | b | 3 | 6 |
We get four columns "A"
, "B"
, "C"
, and "D"
and three rows "0"
, "1"
, and "2"
. The columns "A"
and "B"
contain characters, whereas columns "C"
and "D"
contain integer values.
Now, we apply get_dummies()
with this DataFrame:
pd.get_dummies(df)
Result:
C | D | A_a | A_b | B_a | B_b | B_c | |
0 | 1 | 4 | 1 | 0 | 1 | 0 | 0 |
1 | 2 | 5 | 0 | 1 | 0 | 0 | 1 |
2 | 3 | 6 | 0 | 1 | 0 | 1 | 0 |
The columns "C"
and "D"
remain unchanged because only columns with either “object” or “category” data type will be converted.
We also get two "A_"
columns and three "B_"
columns. That’s because in the initial data frame there are only two unique values in column "A"
and three unique values in column "B"
.
The ones and zeros in the resulting data frame are the dummy variables, just as in the examples above where we applied the get_dummies()
function on lists.
For example, the "1"
in the first row of the "A_a"
column means that the first value from the "A"
column in the initial data frame is the character "a"
.
The “columns” parameter
Especially in large data frames, it might be that we only want to convert specific columns instead of converting every possible column. Therefore, we use the “columns
” parameter which we assign the labels of the columns that we want to convert.
We use the data frame again that we created in the previous section:
A | B | C | D | |
0 | a | a | 1 | 4 |
1 | b | c | 2 | 5 |
2 | b | b | 3 | 6 |
But now, when applying the get_dummies()
function, we add the “columns
” parameter and assign it a list with the list entry "B"
to state that we only want to get the dummy variables of this column:
pd.get_dummies(df, columns=['B'])
Result:
A | C | D | B_a | B_b | B_c | |
0 | a | 1 | 4 | 1 | 0 | 0 |
1 | b | 2 | 5 | 0 | 0 | 1 |
2 | b | 3 | 6 | 0 | 1 | 0 |
The first three columns of the resulting data frame are the unchanged columns. They are the same as in the initial data frame.
The columns "C"
and "D"
are unchanged because they are neither from the “object
” data type nor from the “category
” data type.
And "A"
remains unchanged because we did not add it to our “columns
” parameter’s list.
The last three columns in the resulting data frame are the encoded variables from column "B"
.
By default, the columns
parameter is set to None
. This way, all columns with either “object
” or “category
” data type will be converted. We saw that in the previous examples where we did not set the columns
parameter.
Changing the Prefixes
We can change the prefixes for our new columns in the resulting data frames by adding the prefix
parameter.
Again, we use the data frame df
for this purpose:
A | B | C | D | |
0 | a | a | 1 | 4 |
1 | b | c | 2 | 5 |
2 | b | b | 3 | 6 |
Now, we perform the get_dummies()
operation on this data frame and add the prefix
parameter which we assign a list with the prefix labels for the converted columns.
This list should be the same length as the number of columns that get converted:
pd.get_dummies(df, prefix=['column1', 'column2'])
C | D | column1_a | column1_b | column2_a | column2_b | column2_c | |
0 | 1 | 4 | 1 | 0 | 1 | 0 | 0 |
1 | 2 | 5 | 0 | 1 | 0 | 0 | 1 |
2 | 3 | 6 | 0 | 1 | 0 | 1 | 0 |
Since two columns get encoded ("A"
and "B"
), we apply two prefixes to the prefix
parameter, "column1"
and "column2"
.
The resulting data frame shows the new prefixes for the encoded columns.
If we want to, we can also change the prefix separator by adding the prefix_sep
parameter:
pd.get_dummies(df, prefix=['column1', 'column2'], prefix_sep=':')
Result:
C | D | column1:a | column1:b | column2:a | column2:b | column2:c | |
0 | 1 | 4 | 1 | 0 | 1 | 0 | 0 |
1 | 2 | 5 | 0 | 1 | 0 | 0 | 1 |
2 | 3 | 6 | 0 | 1 | 0 | 1 | 0 |
We perform the same get_dummies()
operation as before, but we add the prefix_sep
parameter and set it to ":"
.
By default, the separator is "_"
, but we can change it to whatever we want.
Summary
All in all, we learned all about the Pandas function get_dummies()
.
We learned the basic functionality of this method, how to handle NaN
values, how to perform the function on data frames as well as lists, how to only encode specific columns, and how to set different prefixes.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter email academy.
Happy Coding!

Hi! I’m Luis, an Information Systems student and freelance writer and programmer from Germany. I love coding and creating educational content about computer science. For the articles I’m writing, I combine the knowledge I gained at the university with the insights I get from constantly reading and learning about new technologies. Making education more accessible for everyone is my passion and I hope you like the content I’m creating!