In Python, the pandas.DataFrame.groupby()
function splits a pandas.DataFrame
into subgroups. It is a part of a full groupby
operation, where we usually split the data, apply a function, and then combine the result.
Here is the argument table of pandas.DataFrame.groupby()
.
If it sounds great to you, please continue reading, and you will fully understand the pandas.DataFrame.groupby()
function through Python code snippets and vivid visualization.
This tutorial is about pandas.DataFrame.groupby()
function.
- First, I will introduce its syntax and arguments.
- Then, you will meet our example dataset and learn basic examples of this function.
- I will address two top questions about
pandas.DataFrame.groupby()
, including multiple columns and aggregation. - I will present you a full group by operation example of
pandas.DataFrame.groupby
.
By the way, we will use 2015-2016 world happiness report dataset throughout our tutorial. The original dataset is download from an open-source Kaggle dataset here.
You can find all the codes and cleaned dataset in this tutorial here.
Syntax and Arguments
Here is the syntax of pandas.DataFrame.groupby()
:
# Syntax pandas.DataFrame.groupby(by=None[, axis=0[, level=None[, as_index=True[, sort=True[, group_keys=True[, squeeze=NoDefault.no_default[, observed=False[, dropna=True]]]]]]]])
Here is the argument table of pandas.DataFrame.groupby()
:
Argument | Accept | Description |
by | mapping, function, label, or list of labels | Criteria for subgroup split. |
axis | {0 or 'index', 1 or 'columns'} , default 0 | Split along rows (0) or columns (1). |
level | int , level name, or sequence of such, default None | If the axis is a MultiIndex (hierarchical), group by a particular level or levels. |
as_index | bool , default True | For aggregated output, return an object with group labels as the index. |
sort | bool , default True | Sort group keys. Get better performance by turning it off. This does not influence the orders within each group. |
group_keys | bool , default True | When calling apply, add group keys to index to identify pieces. |
squeeze | bool , default False | Reduce the dimensionality of the return type if possible, otherwise, return a consistent type. |
observed | bool , default False | This only applies if any of the groupers are Categoricals. If True : only show observed values for categorical groupers. |
dropna | bool , default True | If True , and if group keys contain NA values, NA values together with row/column will be dropped |
The output of pandas.DataFrame.groupby()
function is a groupby
object that contains information about the groups.
Dataset and Basic Examples
We will use 2015-2016 world happiness report dataset throughout our tutorial.
You can find all the codes and cleaned dataset in this tutorial here.
Let’s take a look at its first 5 rows and column names.
Here is the code:
import pandas as pd df = pd.read_csv('world_happiness_report_2015_2016.csv') print(df.head()) print(df.columns)
Output:
So, for our basic example, we can simply split the data into 2015 subgroup and 2016 subgroup based on the column “year”. And we can get a glimpse into grouped result by reading the first row in each subgroup using .first()
method.
import pandas as pd # load the data df = pd.read_csv('world_happiness_report_2015_2016.csv') # split the data into 2015 subgroup and 2016 subgroup based on the column “year” df_year = df.groupby('year') # take a look at the first row in each subgroup df_year_first = df_year.first() print(df_year_first)
Output:
Hooray, we successfully split the data into subgroups based on one column, year.
pd.DataFrame.groupby() Multiple Columns
Sometimes we may want to split the data into subgroups based on more than one column. We can achieve this by passing a list of column names to the groupby()
function.
For example, it is quite meaningful to split our world happiness report dataset based on year and region. In this way, we may discover that some regions tend to score higher overall in the happiness score.
All we need to do is pass ["year", "Region"]
to the groupby()
function. And again, we can use the first()
method to capture the first row in each region and year subgroup.
In our case, they are the top-scored country records in each region and year!
Here is the code:
import pandas as pd # load the data df = pd.read_csv('world_happiness_report_2015_2016.csv') # split the data based on year, region df_year_region = df.groupby(['year', 'Region']) # get the top-scored country records in each region and year df_year_region_first = df_year_region.first()[['Country', 'Happiness Rank']] print(df_year_region_first)
Output:
Just to be clear, as a Chinese, I notice that the country column contains a quite wrong record here. Let’s comprehend the country column as a “region/country” column.
If you also notice something wrong from your cultural background, I am sincerely sorry. Please absorb the usage knowldege of the groupby
fucntion and ignore the output information here.
By the way, if you want a SQL-like output, where the year and region are not treated as the index, you can turn the as_index
parameter to be False
. Now, It is finally pretty self-explanatory. 🙂
pd.DataFrame.groupby() Aggregation
In a full groupby operation, pandas.DataFrame.groupby
finishes it mission by spliting the data into subgroups.
We still need to apply a function and combine the result to get some interesting output. To do so, we can either use the agg()
function or any straight-forward aggregation function on top of the GroupBy
object.
In our world happiness report dataset, we may wonder the average Happiness Rank in each region and year. Okay, so we already how to split the data into subgroups based on region and year. What’s next?
First, let’s use the agg()
function to get the average Happiness Rank in each region and year.
Here is the code:
import pandas as pd # load the data df = pd.read_csv('world_happiness_report_2015_2016.csv') # split the data based on year, region df_year_region = df.groupby(['year', 'Region'])[['Country', 'Happiness Rank']] # use the agg() function to get the average Happiness Rank in each region and year df_avg = df_year_region.agg('mean') print(df_avg)
Output:
Second, let’s use the plain mean()
function to get the average Happiness Rank in each region and year.
Here is the code:
import pandas as pd # load the data df = pd.read_csv('world_happiness_report_2015_2016.csv') # split the data based on year, region df_year_region = df.groupby(['year', 'Region'])[['Country', 'Happiness Rank']] # use the mean() function to get the average Happiness Rank in each region and year df_avg = df_year_region.mean() print(df_avg)
Output:
Hmm, it looks like the Australia and New Zealand people are very happy 🙂
In summary, you can either use the agg()
function or any straight-forward aggregation function on top of the GroupBy
object to finish a full groupby operation.
Personally speaking, I find the latter one much easier to use.
pd.DataFrame.groupby() Full Example
Other than the average Happiness Rank in each region and year, we will use more aggregation functions in this part. In this way, I hope you can get familiar with using the groupby
function and aggregation
functions as a workflow!
Concretely, we will use the count()
function to get the number of top 10 countries in a region each year, then use the min()
function to get the top ranked country in a region each year, and finally use the max()
function to get least ranked country in a region each year.
Here is the code:
import pandas as pd # load the data df = pd.read_csv('world_happiness_report_2015_2016.csv') # use the count() function to get the number of top 10 countries in a region each year df_top_10 = df[df['Happiness Rank'] <= 10] df_year_region = df_top_10.groupby(['year', 'Region'])['Country'] top_10_count = df_year_region.count() print('The number of top 10 countries in a region each year:') print(top_10_count) print('-'*85) # use the min() function to get the top ranked country in a region each year df_year_region = df.groupby(['year', 'Region'])[['Country', 'Happiness Rank']] top_country = df_year_region.min() print('The top ranked country in a region each year:') print(top_country) print('-'*85) # use the max() function to get least ranked country in a region each year. df_year_region = df.groupby(['year', 'Region'])[['Country', 'Happiness Rank']] least_country = df_year_region.max() print('The least ranked country in a region each year:') print(least_country)
Output:
Summary
That’s it for our pd.DataFrame.groupby()
article.
We learned about its syntax, arguments, and basic examples.
We also worked on the top two questions about the pd.DataFrame.groupby()
function, ranging from pandas.DataFrame.groupby
multiple columns and pandas.DataFrame.groupby
aggregation.
Last, we went through a full group by operation example in a real dataset, 2015-2016 world happiness report.
Hope you enjoy all this and happy coding!