The Pandas groupby() Method - Be on the Right Side of Change

In this tutorial, we will see what the Pandas groupby() method is and how we can use it on our datasets. Described in one sentence, the groupby() method is used to group our data and execute a function on the determined groups. It is especially useful to group a large amount of data and to perform operations on these groups.

An Introductory Example

To get a better understanding of the groupby() method, let’s have a look at a simple example:

import pandas as pd

data = {'country': ['Canada', 'South Africa', 'Tanzania', 'Papua New Guinea', 
                    'Namibia', 'Mexico', 'India', 'Malaysia', 'USA'],
        'population': [37.59, 58.56, 58.01, 8.78, 2.49, 127.6, 1366, 31.95, 328.2],
        'continent': ['North America', 'Africa', 'Africa', 'Asia', 'Africa', 
                      'North America', 'Asia', 'Asia', 'North America']
} # population in million

df = pd.DataFrame(data)
df

Here’s the output:

	country	population	continent
0	Canada	37.59	North America
1	South Africa	58.56	Africa
2	Tanzania	58.01	Africa
3	Papua New Guinea	8.78	Asia
4	Namibia	2.49	Africa
5	Mexico	127.60	North America
6	India	1366.00	Asia
7	Malaysia	31.95	Asia
8	USA	328.20	North America

First, we import the necessary libraries which is only Pandas in this case. Then we paste in the data and assign it to the variable “data”. Next, we create a Pandas DataFrame from the data and assign it to the variable “df”. Finally, we output “df”.

This DataFrame shows some countries, the countries’ respective populations, and the continent the countries belong to. To calculate the overall mean population, for example, we would do this:

df.population.mean()
# 224.35333333333335

This line calculates the mean population for all countries in the DataFrame. But what if we wanted to get the mean population per continent? This is where the groupby() method comes into play. Applying this method looks like this:

df.groupby(['continent']).mean()

The output is this DataFrame:

continent	population
Africa	39.686.667
Asia	468.910.000
North America	164.463.333

Here, we group the DataFrame by the “continent” column and calculate the mean values per continent for every numeric column. Since the population column is the only column with a numeric datatype, the output shows a DataFrame with the unique continents in the left column and their related mean populations in the right column. For example, the mean population for Africa was calculated from the mean population of all African countries from the DataFrame (South Africa, Tanzania, Namibia).

If the DataFrame contained more numeric columns, but we only wanted to use one numeric column for the calculation of the mean (in this instance: the “population” column), we could write:

df.groupby(['continent'])['population'].mean()

Here’s the output of this code snippet:

continent
Africa            39.686667
Asia             468.910000
North America    164.463333
Name: population, dtype: float64

This output contains the same information as before, it just adds the “population” column’s data type.

Methods to Execute on the Groups

The mean() method is only one example of a function that can be executed on a group. One more example is the sum() method:

df.groupby(['continent']).sum()

continent	population
Africa	119.06
Asia	1406.73
North America	493.39

Here, the only difference to the example before is that we use the sum() method instead of the mean() method at the end of the line. So, we group the data by continent and calculate the sum of each continent’s population. Similarly, there are tons of other methods we can apply to our groups.

Some popular functions are the max() function which computes the maximum value of each group.
The opposite to that is the min() function which, as the name suggests, calculates each group’s minimum value.
The median() function determines the median of each group.

The possibilities are nearly unlimited.

An elegant way to compute some descriptive statistics on our groups with a very low amount of code is to use the describe() method:

df.groupby(['continent']).describe()

Here’s the resulting DataFrame:

	population
	count	mean	std	min	25%	50%	75%	max
continent
Africa	3.0	39.686.667	32.214.432	2.49	30.250	58.01	58.285	58.56
Asia	3.0	468.910.000	776.989.101	8.78	20.365	31.95	698.975	1366.00
North America	3.0	164.463.333	148.770.703	37.59	82.595	127.60	227.900	328.20

This method provides us with a lot of information about our groups. It counts the values (in this case how many countries are assigned to each continent), computes the mean, the standard deviation, the minimum, and maximum values, as well as the 25th, 50th, and 75th percentile. This is very useful to get a statistical overview of our groups.

Computing Multiple Methods with agg()

As we’ve seen before, the describe() method computes multiple functions on our groups. However, when using the describe() method, we are not able to choose which methods to use. To achieve that, we use the agg() method. Let’s have a look at another code example with another dataset:

import pandas as pd

data = {
    'Team': ['Blues', 'Blues', 'Blues', 'Blues', 'Blues', 'Reds', 
'Reds', 'Reds', 'Reds', 'Reds'],
    'Position': ['Non Forward', 'Forward', 'Non Forward', 
'Non Forward', 'Forward', 'Non Forward', 'Forward', 
'Non Forward', 'Forward', 'Forward'],
    'Age': [23, 19, 31, 25, 27, 18, 41, 28, 23, 24],
    'Height': [1.98, 2.12, 1.97, 2.01, 2.21, 1.99, 2.05, 2.01, 2.12, 
2.14]
}

df = pd.DataFrame(data)

df

Here’s the output:

	Team	Position	Age	Height
0	Blues	Non Forward	23	1.98
1	Blues	Forward	19	2.12
2	Blues	Non Forward	31	1.97
3	Blues	Non Forward	25	2.01
4	Blues	Forward	27	2.21
5	Reds	Non Forward	18	1.99
6	Reds	Forward	41	2.05
7	Reds	Non Forward	28	2.01
8	Reds	Forward	23	2.12
9	Reds	Forward	24	2.14

First, we import the Pandas library. Then we assign the data as a dictionary of lists to a variable called “data”. After that, we create a Pandas DataFrame from the data and assign it to a variable called “df”. Finally, we output the DataFrame. The DataFrame is made of two imaginary basketball teams and it contains the player’s team, if they play in the forward position or not, their age, and their height.

Afterward, we make use of the agg() method:

df.groupby('Team').agg(['median', 'mean', 'std'])

This results in the following DataFrame:

	Age			Height
	median	mean	std	median	mean	std
Team
Blues	25.0	25.0	4.472.136	02.01	2.058	0.103779
Reds	24.0	26.8	8.700.575	02.05	2.062	0.066106

We group the DataFrame by the ‘Team’ column and aggregate the median(), mean(), and std() method to perform them on the groups. The output shows the median, mean, and standard deviation of the player’s age and height respectively for the ‘Blues’ and the ‘Reds’ team. So essentially, the agg() method collects one or more methods and performs them on a group.

In some use cases, we might want to perform different aggregations for different columns on our groups. That approach looks like this:

df.groupby('Team').agg({'Age': ['mean', 'median'], 'Height': 'std'})

The output:

	Age		Height
	mean	median	std
Team
Blues	25.0	25.0	0.103779
Reds	26. Aug	24.0	0.066106

This time, we pass the agg() method a dictionary. The dictionary’s keys contain the column’s names and the values contain the methods that we want to compute on the groups as a list of strings or just a string if only one method is performed on a group.

As you can see, combining the groupby() method with the agg() method is extremely useful. This way, we can perform multiple methods on a group and even individualize the methods for different columns with just one line of code.

Grouping by Multiple Columns

By now, we have learned how we can group a DataFrame by one column and perform methods on this group. However, it is possible and in a lot of use cases vastly convenient to group the DataFrame by multiple columns. To see how that works, let’s refer to the previous example with the basketball teams:

	Team	Position	Age	Height
0	Blues	Non Forward	23	1.98
1	Blues	Forward	19	2.12
2	Blues	Non Forward	31	1.97
3	Blues	Non Forward	25	2.01
4	Blues	Forward	27	2.21
5	Reds	Non Forward	18	1.99
6	Reds	Forward	41	2.05
7	Reds	Non Forward	28	2.01
8	Reds	Forward	23	2.12
9	Reds	Forward	24	2.14

We now apply the groupby() method for multiple columns:

df.groupby(['Team', 'Position']).mean()

The output is the following DataFrame:

		Age	Height
Team	Position
Blues	Forward	23.000	2.165
Non Forward	26.333	1.986
Reds	Forward	29.333	2.103
Non Forward	23.000	2.000

Here, we pass in a list of columns into the groupby() method to determine by which columns we want to group the DataFrame. In this case, we pass in the “Team” and the “Position” column. The mean() at the end of the line means we want to compute the mean value for every numeric column of the DataFrame grouped by the team and position. The first line for example says that a forward from the Blues team is on average 23 years old and 2.165 m tall.

As you can see, grouping by multiple columns serves a useful purpose. Thus, we can compare our data even further and get even more information out of it. Grouping by just one column allows us to only compare the teams or only the positions among each other. Whereas grouping by multiple columns allows us to compare one team’s positions among each other.

Summary

The groupby() method is exceedingly powerful. It allows us to group our data by one or more columns and compute all sorts of functions on these groups. This way, we can compare the groups very smoothly and get a nice overview of our data. All this, with a small amount of code.

If you want to extend your knowledge of this mighty Pandas tool, I recommend you read the official documentation. If you wish to learn more about Pandas, other Python libraries, basic Python, or other computer science topics, you’ll find more tutorials and interesting articles on the Finxter Blog page ***10 Minutes to Pandas***.

Happy Coding!