pd.DataFrame.groupby() – A Simple Illustrated Guide

5/5 - (1 vote)

In Python, the pandas.DataFrame.groupby() function splits a pandas.DataFrame into subgroups. It is a part of a full groupby operation, where we usually split the data, apply a function, and then combine the result.

Here is the argument table of pandas.DataFrame.groupby().

If it sounds great to you, please continue reading, and you will fully understand the pandas.DataFrame.groupby() function through Python code snippets and vivid visualization.

This tutorial is about pandas.DataFrame.groupby() function. 

  • First, I will introduce its syntax and arguments. 
  • Then, you will meet our example dataset and learn basic examples of this function.
  • I will address two top questions about pandas.DataFrame.groupby(), including multiple columns and aggregation.
  • I will present you a full group by operation example of pandas.DataFrame.groupby.

By the way, we will use 2015-2016 world happiness report dataset throughout our tutorial. The original dataset is download from an open-source Kaggle dataset here.

You can find all the codes and cleaned dataset in this tutorial here

Syntax and Arguments

Here is the syntax of pandas.DataFrame.groupby():

# Syntax
pandas.DataFrame.groupby(by=None[, axis=0[, level=None[, as_index=True[, sort=True[, group_keys=True[, squeeze=NoDefault.no_default[, observed=False[, dropna=True]]]]]]]])

Here is the argument table of pandas.DataFrame.groupby():

ArgumentAcceptDescription
bymapping, function, label, or list of labelsCriteria for subgroup split.
axis{0 or 'index', 1 or 'columns'}, default 0Split along rows (0) or columns (1).
levelint, level name, or sequence of such, default NoneIf the axis is a MultiIndex (hierarchical), group by a particular level or levels.
as_indexbool, default TrueFor aggregated output, return an object with group labels as the index.
sortbool, default TrueSort group keys. Get better performance by turning it off. This does not influence the orders within each group.
group_keysbool, default TrueWhen calling apply, add group keys to index to identify pieces.
squeezebool, default FalseReduce the dimensionality of the return type if possible, otherwise, return a consistent type.
observedbool, default FalseThis only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers.
dropnabool, default TrueIf True, and if group keys contain NA values, NA values together with row/column will be dropped

The output of pandas.DataFrame.groupby() function is a groupby object that contains information about the groups.

Dataset and Basic Examples

We will use 2015-2016 world happiness report dataset throughout our tutorial. 

You can find all the codes and cleaned dataset in this tutorial here

Let’s take a look at its first 5 rows and column names.

Here is the code:

import pandas as pd

df = pd.read_csv('world_happiness_report_2015_2016.csv')
print(df.head())
print(df.columns)

Output:

So, for our basic example, we can simply split the data into 2015 subgroup and 2016 subgroup based on the column “year”. And we can get a glimpse into grouped result by reading the first row in each subgroup using .first() method. 

import pandas as pd

# load the data
df = pd.read_csv('world_happiness_report_2015_2016.csv')

# split the data into 2015 subgroup and 2016 subgroup based on the column “year”
df_year = df.groupby('year')

# take a look at the first row in each subgroup
df_year_first = df_year.first()
print(df_year_first)

Output:

Hooray, we successfully split the data into subgroups based on one column, year. 

pd.DataFrame.groupby() Multiple Columns

Sometimes we may want to split the data into subgroups based on more than one column. We can achieve this by passing a list of column names to the groupby() function.

For example, it is quite meaningful to split our world happiness report dataset based on year and region. In this way, we may discover that some regions tend to score higher overall in the happiness score.

All we need to do is pass ["year", "Region"] to the groupby() function. And again, we can use the first() method to capture the first row in each region and year subgroup.

In our case, they are the top-scored country records in each region and year!

Here is the code:

import pandas as pd

# load the data
df = pd.read_csv('world_happiness_report_2015_2016.csv')

# split the data based on year, region
df_year_region = df.groupby(['year', 'Region'])

# get the top-scored country records in each region and year
df_year_region_first = df_year_region.first()[['Country', 'Happiness Rank']]
print(df_year_region_first)

Output:

Just to be clear, as a Chinese, I notice that the country column contains a quite wrong record here. Let’s comprehend the country column as a “region/country” column.

If you also notice something wrong from your cultural background, I am sincerely sorry. Please absorb the usage knowldege of the groupby fucntion and ignore the output information here.

By the way, if you want a SQL-like output, where the year and region are not treated as the index, you can turn the as_index parameter to be False. Now, It is finally pretty self-explanatory. 🙂

pd.DataFrame.groupby() Aggregation

In a full groupby operation, pandas.DataFrame.groupby finishes it mission by spliting the data into subgroups.

We still need to apply a function and combine the result to get some interesting output. To do so, we can either use the agg() function or any straight-forward aggregation function on top of the GroupBy object.

In our world happiness report dataset, we may wonder the average Happiness Rank in each region and year. Okay, so we already how to split the data into subgroups based on region and year. What’s next?

First, let’s use the agg() function to get the average Happiness Rank in each region and year.

Here is the code:

import pandas as pd

# load the data
df = pd.read_csv('world_happiness_report_2015_2016.csv')

# split the data based on year, region
df_year_region = df.groupby(['year', 'Region'])[['Country', 'Happiness Rank']]

# use the agg() function to get the average Happiness Rank in each region and year
df_avg = df_year_region.agg('mean')
print(df_avg)

Output:

Second, let’s use the plain mean() function to get the average Happiness Rank in each region and year.

Here is the code:

import pandas as pd

# load the data
df = pd.read_csv('world_happiness_report_2015_2016.csv')

# split the data based on year, region
df_year_region = df.groupby(['year', 'Region'])[['Country', 'Happiness Rank']]

# use the mean() function to get the average Happiness Rank in each region and year
df_avg = df_year_region.mean()
print(df_avg)

Output:

Hmm, it looks like the Australia and New Zealand people are very happy 🙂

In summary, you can either use the agg() function or any straight-forward aggregation function on top of the GroupBy object to finish a full groupby operation.

Personally speaking, I find the latter one much easier to use.

pd.DataFrame.groupby() Full Example

Other than the average Happiness Rank in each region and year, we will use more aggregation functions in this part. In this way, I hope you can get familiar with using the groupby function and aggregation functions as a workflow!

Concretely, we will use the count() function to get the number of top 10 countries in a region each year, then use the min() function to get the top ranked country in a region each year, and finally use the max() function to get least ranked country in a region each year.

Here is the code:

import pandas as pd

# load the data
df = pd.read_csv('world_happiness_report_2015_2016.csv')

# use the count() function to get the number of top 10 countries in a region each year
df_top_10 = df[df['Happiness Rank'] <= 10]
df_year_region = df_top_10.groupby(['year', 'Region'])['Country']
top_10_count = df_year_region.count()
print('The number of top 10 countries in a region each year:')
print(top_10_count)
print('-'*85)

# use the min() function to get the top ranked country in a region each year
df_year_region = df.groupby(['year', 'Region'])[['Country', 'Happiness Rank']]
top_country = df_year_region.min()
print('The top ranked country in a region each year:')
print(top_country)
print('-'*85)

# use the max() function to get least ranked country in a region each year.
df_year_region = df.groupby(['year', 'Region'])[['Country', 'Happiness Rank']]
least_country = df_year_region.max()
print('The least ranked country in a region each year:')
print(least_country)

Output:

Summary

That’s it for our pd.DataFrame.groupby() article. 

We learned about its syntax, arguments, and basic examples. 

We also worked on the top two questions about the pd.DataFrame.groupby() function, ranging from pandas.DataFrame.groupby multiple columns and pandas.DataFrame.groupby aggregation. 

Last, we went through a full group by operation example in a real dataset, 2015-2016 world happiness report.

Hope you enjoy all this and happy coding!