How To GroupBy A Dataframe In Pandas And Keep Columns

[toc]

The groupby() function saves you a ton of time and headache when analyzing data. It is fast and eases handling massive data. However, you may fail to maximize its potential if you don’t know how to use it. That is why this tutorial explains DataFrame grouping using relatable challenges, code snippets, and solutions.

💡 Grouping by a DataFrame and keeping columns involves four steps: get the data, split it, apply a function, and combine the result.

It would be best to learn data grouping in Pandas before seeing practical examples.

What Exactly Does Groupby A DataFrame in Pandas Mean?

Pandas is one of the crucial packages for analyzing data in Python. It is built on Numpy, and as a result, functions written in Pandas compile rapidly. That makes it relevant in handling big data efficiently.

Pandas have many functions, one of them being the groupby() method that enables you to create sub-data out of the parent data. Simply put, the groupby() function in Pandas filters portions of data and appends it to other columns. Let’s see that practically.

4 Simple Steps To GroupBy A DataFrame In Pandas

The Problem

Given the 2021 Stackoverflow survey data, find the percentage of PHP developers per country. The resulting DataFrame should contain total developers per country, PHP developers among the developers, and their percentage.

The Solution

1️⃣ Get The Data

Download the data here

Open a code editor and import the data as CSV.

import pandas as pd

df = pd.read_csv('survey_results_public.csv')

pd.set_option('display.max_columns', None)
print(df)

I am importing the core library, pandas, and shortening its name as pd. I then read the data using pandas‘ read_csv function with the name of the target file: survey_results_public.csv.

Since the columns get truncated when handling a lengthy file in pandas, I prevent the default behavior using the set_option() function parsing the controls: 'display.max_columns' and None.

Let’s check the rows and columns using the shape attribute.

df.shape

We get a tuple.

(83439, 48)

That means the survey produced 83439 rows and 48 columns.

2️⃣ Split The Data

We have a 48-column data.

ResponseId	MainBranch	Employment	Country	US_State	UK_Country	EdLevel	Age1stCode	LearnCode	YearsCode	YearsCodePro	DevType	OrgSize	Currency	CompTotal	CompFreq	LanguageHaveWorkedWith	LanguageWantToWorkWith	DatabaseHaveWorkedWith	DatabaseWantToWorkWith	PlatformHaveWorkedWith	PlatformWantToWorkWith	WebframeHaveWorkedWith	WebframeWantToWorkWith	MiscTechHaveWorkedWith	MiscTechWantToWorkWith	ToolsTechHaveWorkedWith	ToolsTechWantToWorkWith	NEWCollabToolsHaveWorkedWith	NEWCollabToolsWantToWorkWith	OpSys	NEWStuck	NEWSOSites	SOVisitFreq	SOAccount	SOPartFreq	SOComm	NEWOtherComms	Age	Gender	Trans	Sexuality	Ethnicity	Accessibility	MentalHealth	SurveyLength	SurveyEase	ConvertedCompYearly

We can split the data along the Country column.

country_group = df.groupby('Country')
print(country_group)

We are splitting the data into groups of countries, each country becoming an index with the same columns as the parent data.

Printing the result, we get an object.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FF062FAD30>

The above line states that splitting the initial data produces a bunch of DataFrames attached to the DataFrameGroupBy object and stored at memory location 0x000001FF062FAD30.

Like any other DataFrame, we can view each column of the child group as we would in a dictionary. For example, we can inspect the languages each developer has worked with as follows.

3️⃣ Apply A Function

Knowing the percentage of PHP developers entails getting the total number of developers from each country and followed by PHP developers. We then divide the number of PHP developers by the total number of developers.

The first function to apply to the DataFrame is value_counts() which shows the numeric output total per column. Let’s use the function to check the languages coded per country.

country_group['LanguageHaveWorkedWith'].value_counts()

We get a series with multiple indices. Next, we use the apply() function to search for PHP amongst the languages used.

php_developers = country_group['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('PHP').sum())

print(php_developers)

The apply() function applies the lambda function to every subgroup of the data. The lambda function, in turn, runs through every row, checking for PHP amongst the languages, and gets the total from each using the sum() function.

You should get a Series of countries and the number of PHP developers who participated in the survey. You can use the set_option() method on rows to view all records per country.

pd.set_option('display.max_rows', None)

Next, let’s store the total number of developers from each country who participated in the survey.

all_developers = df['Country'].value_counts()

We get a long list numerically sorted in descending. Let’s proceed by creating a new DataFrame.

4️⃣ Combine The Data

The concat() function can help us create a DataFrame from all developers and PHP developers.

php_df = pd.concat([all_developers, php_developers], axis='columns', sort=False)

We get the following data:

	Country	LanguageHaveWorkedWith
United States of America	15288	2152
India	10511	2044
Germany	5625	1233
United Kingdom of Great Britain and Northern Ireland	4475	815
Canada	3012	483
…	…	…
Saint Kitts and Nevis	1	0
Dominica	1	0
Saint Vincent and the Grenadines	1	0
Tuvalu	1	1
Papua New Guinea	1	0

Let’s rename the above columns: Country to Total Developers and LanguageHaveWorkedWith to PHP Developers.

php_df.rename(columns={'Country': 'Total Developers', 'LanguageHaveWorkedWith': 'PHP Developers'})

We can add the inplace attribute to permanently rename the columns.

php_df.rename(columns={'Country': 'Total Developers', 'LanguageHaveWorkedWith': 'PHP Developers'}, inplace=True)

Lastly, let’s create a new column for the percentages.

php_df['% PHP Developers'] = ( php_df['PHP Developers']/php_df['Total Developers']) * 100

And print the resulting DataFrame.

print(php_df)

We get a new DataFrame.

	Total Developers	PHP Developers	% PHP Developers
United States of America	15288	2152	14.076400
India	10511	2044	19.446294
Germany	5625	1233	21.920000
United Kingdom of Great Britain and Northern Ireland	4475	815	18.212291
Canada	3012	483	16.035857
…	…	…	…
Saint Kitts and Nevis	1	0	0.000000
Dominica	1	0	0.000000
Saint Vincent and the Grenadines	1	0	0.000000
Tuvalu	1	1	100.000000
Papua New Guinea	1	0	0.000000

GroupBy and Keep Columns

Problem

Approach 1: Using size() and reset_index(name=’count’)

Example:

import pandas as pd

data = {
  'Name': ['Book1', 'Book2', 'Book3', 'Book1', 'Book2'],
  'Type': ['ebook', 'paper', 'paper', 'ebook', 'paper'],
  'ID': [1, 2, 3, 1, 2]
}
df = pd.DataFrame(data)
df_2 = df.groupby(['Name','Type','ID']).size().reset_index(name='count')
print(df_2)

Output:

    Name   Type  ID  count
0  Book1  ebook   1      2
1  Book2  paper   2      2
2  Book3  paper   3      1

Approach 2: Count Columns Using transform and then Use drop_duplicates

Example:

import pandas as pd

data = {
  'Name': ['Book1', 'Book2', 'Book3', 'Book1', 'Book2'],
  'Type': ['ebook', 'paper', 'paper', 'ebook', 'paper'],
  'ID': [1, 2, 3, 1, 2]
}
df = pd.DataFrame(data)
df['Count'] = df.groupby(['Name'])['ID'].transform('count')
df = df.drop_duplicates()
print(df)

Output:

    Name   Type  ID  Count
0  Book1  ebook   1      2
1  Book2  paper   2      2
2  Book3  paper   3      1

Conclusion

This tutorial guided you to handle DataFrames using the groupby function. You can now sharpen your skills by manipulating more data, as shown in the examples section.

Learn Pandas the Fun Way by Solving Code Puzzles

If you want to boost your Pandas skills, consider checking out my puzzle-based learning book Coffee Break Pandas (Amazon Link).

It contains 74 hand-crafted Pandas puzzles including explanations. By solving each puzzle, you’ll get a score representing your skill level in Pandas. Can you become a Pandas Grandmaster?

Coffee Break Pandas offers a fun-based approach to data science mastery—and a truly gamified learning experience.