In this tutorial, we will see what the Pandas groupby()
method is and how we can use it on our datasets. Described in one sentence, the groupby()
method is used to group our data and execute a function on the determined groups. It is especially useful to group a large amount of data and to perform operations on these groups.
An Introductory Example
To get a better understanding of the groupby()
method, let’s have a look at a simple example:
import pandas as pd data = {'country': ['Canada', 'South Africa', 'Tanzania', 'Papua New Guinea', 'Namibia', 'Mexico', 'India', 'Malaysia', 'USA'], 'population': [37.59, 58.56, 58.01, 8.78, 2.49, 127.6, 1366, 31.95, 328.2], 'continent': ['North America', 'Africa', 'Africa', 'Asia', 'Africa', 'North America', 'Asia', 'Asia', 'North America'] } # population in million df = pd.DataFrame(data) df
Here’s the output:
country | population | continent | |
0 | Canada | 37.59 | North America |
1 | South Africa | 58.56 | Africa |
2 | Tanzania | 58.01 | Africa |
3 | Papua New Guinea | 8.78 | Asia |
4 | Namibia | 2.49 | Africa |
5 | Mexico | 127.60 | North America |
6 | India | 1366.00 | Asia |
7 | Malaysia | 31.95 | Asia |
8 | USA | 328.20 | North America |
First, we import the necessary libraries which is only Pandas in this case. Then we paste in the data and assign it to the variable “data
”. Next, we create a Pandas DataFrame from the data and assign it to the variable “df
”. Finally, we output “df
”.
This DataFrame shows some countries, the countries’ respective populations, and the continent the countries belong to. To calculate the overall mean population, for example, we would do this:
df.population.mean() # 224.35333333333335
This line calculates the mean population for all countries in the DataFrame. But what if we wanted to get the mean population per continent? This is where the groupby()
method comes into play. Applying this method looks like this:
df.groupby(['continent']).mean()
The output is this DataFrame:
continent | population |
Africa | 39.686.667 |
Asia | 468.910.000 |
North America | 164.463.333 |
Here, we group the DataFrame by the “continent
” column and calculate the mean values per continent for every numeric column. Since the population column is the only column with a numeric datatype, the output shows a DataFrame with the unique continents in the left column and their related mean populations in the right column. For example, the mean population for Africa was calculated from the mean population of all African countries from the DataFrame (South Africa, Tanzania, Namibia).
If the DataFrame contained more numeric columns, but we only wanted to use one numeric column for the calculation of the mean (in this instance: the “population
” column), we could write:
df.groupby(['continent'])['population'].mean()
Here’s the output of this code snippet:
continent Africa 39.686667 Asia 468.910000 North America 164.463333 Name: population, dtype: float64
This output contains the same information as before, it just adds the “population
” column’s data type.
Methods to Execute on the Groups
The mean()
method is only one example of a function that can be executed on a group. One more example is the sum()
method:
df.groupby(['continent']).sum()
continent | population |
Africa | 119.06 |
Asia | 1406.73 |
North America | 493.39 |
Here, the only difference to the example before is that we use the sum()
method instead of the mean()
method at the end of the line. So, we group the data by continent and calculate the sum of each continent’s population. Similarly, there are tons of other methods we can apply to our groups.
- Some popular functions are the
max()
function which computes the maximum value of each group. - The opposite to that is the
min()
function which, as the name suggests, calculates each group’s minimum value. - The
median()
function determines the median of each group.
The possibilities are nearly unlimited.
An elegant way to compute some descriptive statistics on our groups with a very low amount of code is to use the describe()
method:
df.groupby(['continent']).describe()
Here’s the resulting DataFrame:
population | ||||||||
count | mean | std | min | 25% | 50% | 75% | max | |
continent | ||||||||
Africa | 3.0 | 39.686.667 | 32.214.432 | 2.49 | 30.250 | 58.01 | 58.285 | 58.56 |
Asia | 3.0 | 468.910.000 | 776.989.101 | 8.78 | 20.365 | 31.95 | 698.975 | 1366.00 |
North America | 3.0 | 164.463.333 | 148.770.703 | 37.59 | 82.595 | 127.60 | 227.900 | 328.20 |
This method provides us with a lot of information about our groups. It counts the values (in this case how many countries are assigned to each continent), computes the mean, the standard deviation, the minimum, and maximum values, as well as the 25th, 50th, and 75th percentile. This is very useful to get a statistical overview of our groups.
Computing Multiple Methods with agg()
As we’ve seen before, the describe()
method computes multiple functions on our groups. However, when using the describe()
method, we are not able to choose which methods to use. To achieve that, we use the agg()
method. Let’s have a look at another code example with another dataset:
import pandas as pd data = { 'Team': ['Blues', 'Blues', 'Blues', 'Blues', 'Blues', 'Reds', 'Reds', 'Reds', 'Reds', 'Reds'], 'Position': ['Non Forward', 'Forward', 'Non Forward', 'Non Forward', 'Forward', 'Non Forward', 'Forward', 'Non Forward', 'Forward', 'Forward'], 'Age': [23, 19, 31, 25, 27, 18, 41, 28, 23, 24], 'Height': [1.98, 2.12, 1.97, 2.01, 2.21, 1.99, 2.05, 2.01, 2.12, 2.14] } df = pd.DataFrame(data) df
Here’s the output:
Team | Position | Age | Height | |
0 | Blues | Non Forward | 23 | 1.98 |
1 | Blues | Forward | 19 | 2.12 |
2 | Blues | Non Forward | 31 | 1.97 |
3 | Blues | Non Forward | 25 | 2.01 |
4 | Blues | Forward | 27 | 2.21 |
5 | Reds | Non Forward | 18 | 1.99 |
6 | Reds | Forward | 41 | 2.05 |
7 | Reds | Non Forward | 28 | 2.01 |
8 | Reds | Forward | 23 | 2.12 |
9 | Reds | Forward | 24 | 2.14 |
First, we import the Pandas library. Then we assign the data as a dictionary of lists to a variable called “data
”. After that, we create a Pandas DataFrame from the data and assign it to a variable called “df
”. Finally, we output the DataFrame. The DataFrame is made of two imaginary basketball teams and it contains the player’s team, if they play in the forward position or not, their age, and their height.
Afterward, we make use of the agg()
method:
df.groupby('Team').agg(['median', 'mean', 'std'])
This results in the following DataFrame:
Age | Height | |||||
median | mean | std | median | mean | std | |
Team | ||||||
Blues | 25.0 | 25.0 | 4.472.136 | 02.01 | 2.058 | 0.103779 |
Reds | 24.0 | 26.8 | 8.700.575 | 02.05 | 2.062 | 0.066106 |
We group the DataFrame by the ‘Team
’ column and aggregate the median()
, mean()
, and std()
method to perform them on the groups. The output shows the median, mean, and standard deviation of the player’s age and height respectively for the ‘Blues
’ and the ‘Reds
’ team. So essentially, the agg()
method collects one or more methods and performs them on a group.
In some use cases, we might want to perform different aggregations for different columns on our groups. That approach looks like this:
df.groupby('Team').agg({'Age': ['mean', 'median'], 'Height': 'std'})
The output:
Age | Height | ||
mean | median | std | |
Team | |||
Blues | 25.0 | 25.0 | 0.103779 |
Reds | 26. Aug | 24.0 | 0.066106 |
This time, we pass the agg()
method a dictionary. The dictionary’s keys contain the column’s names and the values contain the methods that we want to compute on the groups as a list of strings or just a string if only one method is performed on a group.
As you can see, combining the groupby()
method with the agg()
method is extremely useful. This way, we can perform multiple methods on a group and even individualize the methods for different columns with just one line of code.
Grouping by Multiple Columns
By now, we have learned how we can group a DataFrame by one column and perform methods on this group. However, it is possible and in a lot of use cases vastly convenient to group the DataFrame by multiple columns. To see how that works, let’s refer to the previous example with the basketball teams:
Team | Position | Age | Height | |
0 | Blues | Non Forward | 23 | 1.98 |
1 | Blues | Forward | 19 | 2.12 |
2 | Blues | Non Forward | 31 | 1.97 |
3 | Blues | Non Forward | 25 | 2.01 |
4 | Blues | Forward | 27 | 2.21 |
5 | Reds | Non Forward | 18 | 1.99 |
6 | Reds | Forward | 41 | 2.05 |
7 | Reds | Non Forward | 28 | 2.01 |
8 | Reds | Forward | 23 | 2.12 |
9 | Reds | Forward | 24 | 2.14 |
We now apply the groupby()
method for multiple columns:
df.groupby(['Team', 'Position']).mean()
The output is the following DataFrame:
Age | Height | ||
Team | Position | ||
Blues | Forward | 23.000 | 2.165 |
Non Forward | 26.333 | 1.986 | |
Reds | Forward | 29.333 | 2.103 |
Non Forward | 23.000 | 2.000 |
Here, we pass in a list of columns into the groupby()
method to determine by which columns we want to group the DataFrame. In this case, we pass in the “Team
” and the “Position
” column. The mean()
at the end of the line means we want to compute the mean value for every numeric column of the DataFrame grouped by the team and position. The first line for example says that a forward from the Blues team is on average 23 years old and 2.165 m tall.
As you can see, grouping by multiple columns serves a useful purpose. Thus, we can compare our data even further and get even more information out of it. Grouping by just one column allows us to only compare the teams or only the positions among each other. Whereas grouping by multiple columns allows us to compare one team’s positions among each other.
Summary
The groupby()
method is exceedingly powerful. It allows us to group our data by one or more columns and compute all sorts of functions on these groups. This way, we can compare the groups very smoothly and get a nice overview of our data. All this, with a small amount of code.
If you want to extend your knowledge of this mighty Pandas tool, I recommend you read the official documentation. If you wish to learn more about Pandas, other Python libraries, basic Python, or other computer science topics, you’ll find more tutorials and interesting articles on the Finxter Blog page ***10 Minutes to Pandas***.
Happy Coding!