[toc]
The groupby()
function saves you a ton of time and headache when analyzing data. It is fast and eases handling massive data. However, you may fail to maximize its potential if you don’t know how to use it. That is why this tutorial explains DataFrame grouping using relatable challenges, code snippets, and solutions.
π‘ Grouping by a DataFrame and keeping columns involves four steps: get the data, split it, apply a function, and combine the result.
It would be best to learn data grouping in Pandas before seeing practical examples.
What Exactly Does Groupby A DataFrame in Pandas Mean?
Pandas is one of the crucial packages for analyzing data in Python. It is built on Numpy, and as a result, functions written in Pandas compile rapidly. That makes it relevant in handling big data efficiently.
Pandas have many functions, one of them being the groupby()
method that enables you to create sub-data out of the parent data. Simply put, the groupby()
function in Pandas filters portions of data and appends it to other columns. Let’s see that practically.
4 Simple Steps To GroupBy A DataFrame In Pandas
The Problem
Given the 2021 Stackoverflow survey data, find the percentage of PHP developers per country. The resulting DataFrame should contain total developers per country, PHP developers among the developers, and their percentage.
The Solution
1οΈβ£ Get The Data
Open a code editor and import the data as CSV.
import pandas as pd df = pd.read_csv('survey_results_public.csv') pd.set_option('display.max_columns', None) print(df)
I am importing the core library, pandas
, and shortening its name as pd
. I then read the data using pandas
‘ read_csv
function with the name of the target file: survey_results_public.csv
.
Since the columns get truncated when handling a lengthy file in pandas
, I prevent the default behavior using the set_option()
function parsing the controls: 'display.max_columns'
and None
.
Let’s check the rows and columns using the shape
attribute.
df.shape
We get a tuple.
(83439, 48)
That means the survey produced 83439 rows and 48 columns.
2οΈβ£ Split The Data
We have a 48-column data.
ResponseId | MainBranch | Employment | Country | US_State | UK_Country | EdLevel | Age1stCode | LearnCode | YearsCode | YearsCodePro | DevType | OrgSize | Currency | CompTotal | CompFreq | LanguageHaveWorkedWith | LanguageWantToWorkWith | DatabaseHaveWorkedWith | DatabaseWantToWorkWith | PlatformHaveWorkedWith | PlatformWantToWorkWith | WebframeHaveWorkedWith | WebframeWantToWorkWith | MiscTechHaveWorkedWith | MiscTechWantToWorkWith | ToolsTechHaveWorkedWith | ToolsTechWantToWorkWith | NEWCollabToolsHaveWorkedWith | NEWCollabToolsWantToWorkWith | OpSys | NEWStuck | NEWSOSites | SOVisitFreq | SOAccount | SOPartFreq | SOComm | NEWOtherComms | Age | Gender | Trans | Sexuality | Ethnicity | Accessibility | MentalHealth | SurveyLength | SurveyEase | ConvertedCompYearly |
---|
We can split the data along the Country
column.
country_group = df.groupby('Country') print(country_group)
We are splitting the data into groups of countries, each country becoming an index with the same columns as the parent data.
Printing the result, we get an object.
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FF062FAD30>
The above line states that splitting the initial data produces a bunch of DataFrames attached to the DataFrameGroupBy object and stored at memory location 0x000001FF062FAD30.
Like any other DataFrame, we can view each column of the child group as we would in a dictionary. For example, we can inspect the languages each developer has worked with as follows.
3οΈβ£ Apply A Function
Knowing the percentage of PHP developers entails getting the total number of developers from each country and followed by PHP developers. We then divide the number of PHP developers by the total number of developers.
The first function to apply to the DataFrame is value_counts()
which shows the numeric output total per column. Let’s use the function to check the languages coded per country.
country_group['LanguageHaveWorkedWith'].value_counts()
We get a series with multiple indices. Next, we use the apply()
function to search for PHP amongst the languages used.
php_developers = country_group['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('PHP').sum()) print(php_developers)
The apply()
function applies the lambda function to every subgroup of the data. The lambda function, in turn, runs through every row, checking for PHP amongst the languages, and gets the total from each using the sum()
function.
You should get a Series of countries and the number of PHP developers who participated in the survey. You can use the set_option()
method on rows to view all records per country.
pd.set_option('display.max_rows', None)
Next, let’s store the total number of developers from each country who participated in the survey.
all_developers = df['Country'].value_counts()
We get a long list numerically sorted in descending. Let’s proceed by creating a new DataFrame.
4οΈβ£ Combine The Data
The concat()
function can help us create a DataFrame from all developers and PHP developers.
php_df = pd.concat([all_developers, php_developers], axis='columns', sort=False)
We get the following data:
Country | LanguageHaveWorkedWith | |
---|---|---|
United States of America | 15288 | 2152 |
India | 10511 | 2044 |
Germany | 5625 | 1233 |
United Kingdom of Great Britain and Northern Ireland | 4475 | 815 |
Canada | 3012 | 483 |
β¦ | β¦ | β¦ |
Saint Kitts and Nevis | 1 | 0 |
Dominica | 1 | 0 |
Saint Vincent and the Grenadines | 1 | 0 |
Tuvalu | 1 | 1 |
Papua New Guinea | 1 | 0 |
Let’s rename the above columns: Country to Total Developers and LanguageHaveWorkedWith to PHP Developers.
php_df.rename(columns={'Country': 'Total Developers', 'LanguageHaveWorkedWith': 'PHP Developers'})
We can add the inplace
attribute to permanently rename the columns.
php_df.rename(columns={'Country': 'Total Developers', 'LanguageHaveWorkedWith': 'PHP Developers'}, inplace=True)
Lastly, let’s create a new column for the percentages.
php_df['% PHP Developers'] = ( php_df['PHP Developers']/php_df['Total Developers']) * 100
And print the resulting DataFrame.
print(php_df)
We get a new DataFrame.
Total Developers | PHP Developers | % PHP Developers | |
---|---|---|---|
United States of America | 15288 | 2152 | 14.076400 |
India | 10511 | 2044 | 19.446294 |
Germany | 5625 | 1233 | 21.920000 |
United Kingdom of Great Britain and Northern Ireland | 4475 | 815 | 18.212291 |
Canada | 3012 | 483 | 16.035857 |
β¦ | β¦ | β¦ | β¦ |
Saint Kitts and Nevis | 1 | 0 | 0.000000 |
Dominica | 1 | 0 | 0.000000 |
Saint Vincent and the Grenadines | 1 | 0 | 0.000000 |
Tuvalu | 1 | 1 | 100.000000 |
Papua New Guinea | 1 | 0 | 0.000000 |
GroupBy and Keep Columns
Problem
Approach 1: Using size() and reset_index(name=’count’)
Example:
import pandas as pd data = { 'Name': ['Book1', 'Book2', 'Book3', 'Book1', 'Book2'], 'Type': ['ebook', 'paper', 'paper', 'ebook', 'paper'], 'ID': [1, 2, 3, 1, 2] } df = pd.DataFrame(data) df_2 = df.groupby(['Name','Type','ID']).size().reset_index(name='count') print(df_2)
Output:
Name Type ID count 0 Book1 ebook 1 2 1 Book2 paper 2 2 2 Book3 paper 3 1
Approach 2: Count Columns Using transform and then Use drop_duplicates
Example:
import pandas as pd data = { 'Name': ['Book1', 'Book2', 'Book3', 'Book1', 'Book2'], 'Type': ['ebook', 'paper', 'paper', 'ebook', 'paper'], 'ID': [1, 2, 3, 1, 2] } df = pd.DataFrame(data) df['Count'] = df.groupby(['Name'])['ID'].transform('count') df = df.drop_duplicates() print(df)
Output:
Name Type ID Count 0 Book1 ebook 1 2 1 Book2 paper 2 2 2 Book3 paper 3 1
Conclusion
This tutorial guided you to handle DataFrames using the groupby function. You can now sharpen your skills by manipulating more data, as shown in the examples section.
Learn Pandas the Fun Way by Solving Code Puzzles
If you want to boost your Pandas skills, consider checking out my puzzle-based learning book Coffee Break Pandas (Amazon Link).
It contains 74 hand-crafted Pandas puzzles including explanations. By solving each puzzle, you’ll get a score representing your skill level in Pandas. Can you become a Pandas Grandmaster?
Coffee Break Pandas offers a fun-based approach to data science mastery—and a truly gamified learning experience.