The Pandas groupby() Method

Rate this post

In this tutorial, we will see what the Pandas groupby() method is and how we can use it on our datasets. Described in one sentence, the groupby() method is used to group our data and execute a function on the determined groups. It is especially useful to group a large amount of data and to perform operations on these groups.

An Introductory Example

To get a better understanding of the groupby() method, let’s have a look at a simple example:

import pandas as pd

data = {'country': ['Canada', 'South Africa', 'Tanzania', 'Papua New Guinea', 
                    'Namibia', 'Mexico', 'India', 'Malaysia', 'USA'],
        'population': [37.59, 58.56, 58.01, 8.78, 2.49, 127.6, 1366, 31.95, 328.2],
        'continent': ['North America', 'Africa', 'Africa', 'Asia', 'Africa', 
                      'North America', 'Asia', 'Asia', 'North America']
} # population in million

df = pd.DataFrame(data)
df

Here’s the output:

countrypopulationcontinent
0Canada37.59North America
1South Africa58.56Africa
2Tanzania58.01Africa
3Papua New Guinea8.78Asia
4Namibia2.49Africa
5Mexico127.60North America
6India1366.00Asia
7Malaysia31.95Asia
8USA328.20North America

First, we import the necessary libraries which is only Pandas in this case. Then we paste in the data and assign it to the variable “data”. Next, we create a Pandas DataFrame from the data and assign it to the variable “df”. Finally, we output “df”.

This DataFrame shows some countries, the countries’ respective populations, and the continent the countries belong to. To calculate the overall mean population, for example, we would do this:

df.population.mean()
# 224.35333333333335

This line calculates the mean population for all countries in the DataFrame. But what if we wanted to get the mean population per continent? This is where the groupby() method comes into play. Applying this method looks like this:

df.groupby(['continent']).mean()

The output is this DataFrame:

continentpopulation
Africa39.686.667
Asia468.910.000
North America164.463.333

Here, we group the DataFrame by the “continent” column and calculate the mean values per continent for every numeric column. Since the population column is the only column with a numeric datatype, the output shows a DataFrame with the unique continents in the left column and their related mean populations in the right column. For example, the mean population for Africa was calculated from the mean population of all African countries from the DataFrame (South Africa, Tanzania, Namibia).

If the DataFrame contained more numeric columns, but we only wanted to use one numeric column for the calculation of the mean (in this instance: the “population” column), we could write:

df.groupby(['continent'])['population'].mean()

Here’s the output of this code snippet:

continent
Africa            39.686667
Asia             468.910000
North America    164.463333
Name: population, dtype: float64

This output contains the same information as before, it just adds the “population” column’s data type.   

Methods to Execute on the Groups

The mean() method is only one example of a function that can be executed on a group. One more example is the sum() method:

df.groupby(['continent']).sum()
  continent  population
Africa119.06
Asia1406.73
North America493.39

Here, the only difference to the example before is that we use the sum() method instead of the mean() method at the end of the line. So, we group the data by continent and calculate the sum of each continent’s population. Similarly, there are tons of other methods we can apply to our groups.

  • Some popular functions are the max() function which computes the maximum value of each group.
  • The opposite to that is the min() function which, as the name suggests, calculates each group’s minimum value.
  • The median() function determines the median of each group.

The possibilities are nearly unlimited.

An elegant way to compute some descriptive statistics on our groups with a very low amount of code is to use the describe() method:

df.groupby(['continent']).describe()

Here’s the resulting DataFrame:

population
countmeanstdmin25%50%75%max
continent
Africa3.039.686.66732.214.4322.4930.25058.0158.28558.56
Asia3.0468.910.000776.989.1018.7820.36531.95698.9751366.00
North America3.0164.463.333148.770.70337.5982.595127.60227.900328.20

This method provides us with a lot of information about our groups. It counts the values (in this case how many countries are assigned to each continent), computes the mean, the standard deviation, the minimum, and maximum values, as well as the 25th, 50th, and 75th percentile. This is very useful to get a statistical overview of our groups.

Computing Multiple Methods with agg()

As we’ve seen before, the describe() method computes multiple functions on our groups. However, when using the describe() method, we are not able to choose which methods to use. To achieve that, we use the agg() method. Let’s have a look at another code example with another dataset:

import pandas as pd

data = {
    'Team': ['Blues', 'Blues', 'Blues', 'Blues', 'Blues', 'Reds', 
'Reds', 'Reds', 'Reds', 'Reds'],
    'Position': ['Non Forward', 'Forward', 'Non Forward', 
'Non Forward', 'Forward', 'Non Forward', 'Forward', 
'Non Forward', 'Forward', 'Forward'],
    'Age': [23, 19, 31, 25, 27, 18, 41, 28, 23, 24],
    'Height': [1.98, 2.12, 1.97, 2.01, 2.21, 1.99, 2.05, 2.01, 2.12, 
2.14]
}

df = pd.DataFrame(data)

df

Here’s the output:

TeamPositionAgeHeight
0BluesNon Forward231.98
1BluesForward192.12
2BluesNon Forward311.97
3BluesNon Forward252.01
4BluesForward272.21
5RedsNon Forward181.99
6RedsForward412.05
7RedsNon Forward282.01
8RedsForward232.12
9RedsForward242.14

First, we import the Pandas library. Then we assign the data as a dictionary of lists to a variable called “data”. After that, we create a Pandas DataFrame from the data and assign it to a variable called “df”. Finally, we output the DataFrame. The DataFrame is made of two imaginary basketball teams and it contains the player’s team, if they play in the forward position or not, their age, and their height.

Afterward, we make use of the agg() method:

df.groupby('Team').agg(['median', 'mean', 'std'])

This results in the following DataFrame:

AgeHeight
medianmeanstdmedianmeanstd
Team
Blues25.025.04.472.13602.012.0580.103779
Reds24.026.88.700.57502.052.0620.066106

We group the DataFrame by the ‘Team’ column and aggregate the median(), mean(), and std() method to perform them on the groups. The output shows the median, mean, and standard deviation of the player’s age and height respectively for the ‘Blues’ and the ‘Reds’ team. So essentially, the agg() method collects one or more methods and performs them on a group.

In some use cases, we might want to perform different aggregations for different columns on our groups. That approach looks like this:

df.groupby('Team').agg({'Age': ['mean', 'median'], 'Height': 'std'})

The output:

AgeHeight
meanmedianstd
Team
Blues25.025.00.103779
Reds26. Aug24.00.066106
      

This time, we pass the agg() method a dictionary. The dictionary’s keys contain the column’s names and the values contain the methods that we want to compute on the groups as a list of strings or just a string if only one method is performed on a group.  

As you can see, combining the groupby() method with the agg() method is extremely useful. This way, we can perform multiple methods on a group and even individualize the methods for different columns with just one line of code.

Grouping by Multiple Columns

By now, we have learned how we can group a DataFrame by one column and perform methods on this group. However, it is possible and in a lot of use cases vastly convenient to group the DataFrame by multiple columns. To see how that works, let’s refer to the previous example with the basketball teams:

TeamPositionAgeHeight
0BluesNon Forward231.98
1BluesForward192.12
2BluesNon Forward311.97
3BluesNon Forward252.01
4BluesForward272.21
5RedsNon Forward181.99
6RedsForward412.05
7RedsNon Forward282.01
8RedsForward232.12
9RedsForward242.14

We now apply the groupby() method for multiple columns:

df.groupby(['Team', 'Position']).mean()

The output is the following DataFrame:

AgeHeight
TeamPosition
BluesForward23.0002.165
Non Forward26.3331.986
RedsForward29.3332.103
Non Forward23.0002.000

Here, we pass in a list of columns into the groupby() method to determine by which columns we want to group the DataFrame. In this case, we pass in the “Team” and the “Position” column. The mean() at the end of the line means we want to compute the mean value for every numeric column of the DataFrame grouped by the team and position. The first line for example says that a forward from the Blues team is on average 23 years old and 2.165 m tall.

As you can see, grouping by multiple columns serves a useful purpose. Thus, we can compare our data even further and get even more information out of it. Grouping by just one column allows us to only compare the teams or only the positions among each other. Whereas grouping by multiple columns allows us to compare one team’s positions among each other.

Summary

The groupby() method is exceedingly powerful. It allows us to group our data by one or more columns and compute all sorts of functions on these groups. This way, we can compare the groups very smoothly and get a nice overview of our data. All this, with a small amount of code.

If you want to extend your knowledge of this mighty Pandas tool, I recommend you read the official documentation. If you wish to learn more about Pandas, other Python libraries, basic Python, or other computer science topics, you’ll find more tutorials and interesting articles on the Finxter Blog page ***10 Minutes to Pandas***.

Happy Coding!