In this tutorial, we will learn to use the Pandas function pivot_table()
. This function is used to create a pivot table as a data frame. It allows for lots of customization possibilities to provide informative insights into our data.
Syntax Table
Here are the parameters from the official documentation:
Parameter | Type | Description |
data | DataFrame | |
values | column to aggregate, optional | |
index | column, Grouper, array, or list of the previous | – If array, must be the same length as data . It’s used the same way as column values.– The list can contain any of the other types but no list. – Keys to group by on the pivot table index. |
columns | column, Grouper, array, or list of the previous | – If array, must be the same length as the data. It’s used the same way as column values. – The list can contain any of the other types but no list. – Keys to group by on the pivot table column. |
aggfunc | function, list of functions, dict, default numpy.mean | – If list of functions, the resulting pivot table will have hierarchical columns whose top level are the function names. Those are inferred from the function objects themselves. – If dictionary, the key is the column to aggregate and the value is function or list of functions. |
fill_value | scalar, default None | Value to replace missing values with in the resulting pivot table, after aggregation. |
margins | bool, default False | Add all row/columns. Example: subtotal/grand totals. |
dropna | bool, default True | Do not include columns whose entries are all NaN. |
margins_name | str, default 'All' | Name of the row/column that will contain the totals when margins is True . |
observed | bool, default False | This only applies if any of the groupers are Categoricals. – If True : only show observed values for categorical groupers. – If False : show all values for categorical groupers. |
sort | bool, default True | Sort the result? |
Returns | Type | Description |
DataFrame | An Excel-style pivot table. |
Basic Example
To get started, we will have a look at an introductory example. We create a data frame that we will be using throughout the rest of the article:
import pandas as pd import numpy as np df = pd.DataFrame( {'Team': ['Reds', 'Reds', 'Reds', 'Reds', 'Reds', 'Reds', 'Reds', 'Blues', 'Blues', 'Blues', 'Blues', 'Blues', 'Blues', 'Blues'], 'Position': ['Pitcher', 'Pitcher', 'Pitcher', 'Center Field', 'Center Field', 'Center Field', 'Center Field', 'Pitcher', 'Pitcher', 'Pitcher', 'Center Field', 'Center Field', 'Center Field', 'Center Field'], 'Throw Arm': ['Left', 'Right', 'Right', 'Left', 'Left', 'Right', 'Left', 'Left', 'Right', 'Right', 'Right', 'Left', 'Left', 'Left',], 'Throwing Speed': [80, 75, 69, 71, 68, 59, 80, 81, 74, 68, 70, 79, 76, 73], 'Running Speed': [19, 20, 18, 17, 19, 20, 16, 16, 19, 20, 19, 20, 21, 17]}) print(df)
The resulting DataFrame:
Team | Position | Throw Arm | Throwing Speed | Running Speed | |
0 | Reds | Pitcher | Left | 80 | 19 |
1 | Reds | Pitcher | Right | 75 | 20 |
2 | Reds | Pitcher | Right | 69 | 18 |
3 | Reds | Center Field | Left | 71 | 17 |
4 | Reds | Center Field | Left | 68 | 19 |
5 | Reds | Center Field | Right | 59 | 20 |
6 | Reds | Center Field | Left | 80 | 16 |
7 | Blues | Pitcher | Left | 81 | 16 |
8 | Blues | Pitcher | Right | 74 | 19 |
9 | Blues | Pitcher | Right | 68 | 20 |
10 | Blues | Center Field | Right | 70 | 19 |
11 | Blues | Center Field | Left | 79 | 20 |
12 | Blues | Center Field | Left | 76 | 21 |
13 | Blues | Center Field | Left | 73 | 17 |
First, we import the necessary libraries, namely Pandas and Numpy. Then, we create a Pandas data frame and assign it to a variable called “df
“. Finally, we output the data frame.
The data frame contains information about players of two Baseball teams. For each player, we get the player’s team, position, throw arm, throw speed, and running speed.
Now, we apply the pivot_table()
function:
pd.pivot_table(df, values='Throwing Speed', index='Team', aggfunc=np.mean)
Output:
Inside the function, we pass in the data frame as the first argument.
The second argument is the “values
” parameter. This parameter expects the column or columns that we want to aggregate, and we assign it the column “Throwing Speed”.
The next parameter is the “index
” parameter which we assign the column to be the index of the new data frame we create.
The final parameter here is the “aggfunc
” parameter. This parameter expects the function or functions that we want to apply on the aggregated column. We set it equal to “np.mean
” which is a Numpy function for calculating the mean value. We would not have to apply the “aggfunc
” parameter to calculate the mean since the default value for this parameter is “np.mean
“. But we do it anyway to make it easier to understand what is happening here.
The output shows a new data frame with the “Team” column as the index. For each team, we get the average throwing speed because the “Throwing Speed” column is the aggregated column and we assigned the “aggfunc
” parameter the np.mean()
function.
Apply Multiple Indexes
In the previous example, we saw how to calculate the average throwing speed of each team.
Now, we will go more into detail and calculate the average throwing speed for each position of each team. We achieve that by assigning the “index
” parameter multiple columns:
pd.pivot_table(df, values='Throwing Speed', index=['Team', 'Position'], aggfunc=np.mean)
Output:
We keep the rest of the function as it was before, but this time, we assign the “index
” parameter a list of columns, namely the “Team” and the “Position” column.
This way, we created a multi-index data frame.
Add Columns to Group by
When we have a look at the initial data frame, we observe that we haven’t used the “Throw Arm” column in any way yet.
We could add it to the “index
” parameter which creates this new data frame:
pd.pivot_table(df, values='Throwing Speed', index=['Team', 'Position', 'Throw Arm'], aggfunc=np.mean)
Here’s the output table:
This way, we get the average throwing speed per throw arm per position per team. However, it looks a bit confusing here as there are too many index columns.
An alternative way to apply the “Throw Arm” column within our pivot_table()
function, is to assign it to the “columns
” parameter:
pd.pivot_table(df, values='Throwing Speed', index=['Team', 'Position'], columns = ['Throw Arm'], aggfunc=np.mean)
Here’s the output:
When we compare the two pivot tables here, we see that they contain the same information. The first column, as well as the second column, both provide information about the average throwing speed for each throw arm of each position of each team. Nevertheless, the second approach is way clearer as we have half as many rows here.
Thus, combining the “index
” parameter with the “columns
” parameter allows us to crave lots of information out of our data in a comprehensible way.
Other Aggregate Functions
By now, we only applied the np.mean()
function to our pivot table. Of course, there are multiple other functions to make use of. One example is the np.max()
function:
pd.pivot_table(df, values='Throwing Speed', index=['Team', 'Position'], columns = ['Throw Arm'], aggfunc=np.max)
The output:
So, here we calculate the maximum value of the throwing speed of each throwing arm, position, and team. For example, the maximum value for throwing speed of the left-handed, Center Field positioned Blue’s Team players is 79.
This is just one example. There are all sorts of other functions to apply here.
Apply Multiple Functions
The “aggfunc
” parameter allows us to perform multiple functions at once. So, the resulting data frame provides us with even more information about our data.
But to achieve that, we first have to assign the “values
” parameter multiple columns to execute different functions on different columns:
pd.pivot_table(df, values=['Throwing Speed', 'Running Speed'], index=['Team', 'Position'], columns = ['Throw Arm'], aggfunc={'Throwing Speed': np.mean, 'Running Speed': np.median})
Output:
We now assign the “values
” parameter the “Throwing Speed” column, as well as the “Running Speed” column. The “aggfunc
” parameter gets assigned a dictionary. The dictionary’s keys are the column names to perform the function on and the dictionary’s values are the functions. We calculate the mean for the “Throwing Speed” and the median for the “Running Speed”.
We can extend that even further by deploying multiple functions to a column. We do that by applying a list of functions within the dictionary:
pd.pivot_table(df, values=['Throwing Speed', 'Running Speed'], index=['Team', 'Position'], columns = ['Throw Arm'], aggfunc={'Throwing Speed': np.mean, 'Running Speed': [np.min, np.max, np.median]})
Output:
We assign the “Running Speed” column in our “aggfunc
” parameter a list of functions, namely the np.min()
function, the np.max()
function, and the np.median()
function. The outputted data frame shows the name of each function at the top.
Summary
All in all, the pivot_table()
function provides us with lots of opportunities to create a pivot table that puts out a lot of information about our data. We learned how to use the function’s various parameters to analyze our data in several ways. We saw how to create a pivot table, how to do so with multiple indexes, how to apply key columns to group by, and how to add multiple functions for aggregation.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.
Happy Coding!