How To GroupBy A Dataframe In Pandas And Keep Columns

[toc]

The groupby() function saves you a ton of time and headache when analyzing data. It is fast and eases handling massive data. However, you may fail to maximize its potential if you don’t know how to use it. That is why this tutorial explains DataFrame grouping using relatable challenges, code snippets, and solutions.

πŸ’‘ Grouping by a DataFrame and keeping columns involves four steps: get the data, split it, apply a function, and combine the result.

It would be best to learn data grouping in Pandas before seeing practical examples.

What Exactly Does Groupby A DataFrame in Pandas Mean?

Pandas is one of the crucial packages for analyzing data in Python. It is built on Numpy, and as a result, functions written in Pandas compile rapidly. That makes it relevant in handling big data efficiently.

Pandas have many functions, one of them being the groupby() method that enables you to create sub-data out of the parent data. Simply put, the groupby() function in Pandas filters portions of data and appends it to other columns. Let’s see that practically.

4 Simple Steps To GroupBy A DataFrame In Pandas

The Problem

Given the 2021 Stackoverflow survey data, find the percentage of PHP developers per country. The resulting DataFrame should contain total developers per country, PHP developers among the developers, and their percentage.

The Solution

1️⃣ Get The Data

Open a code editor and import the data as CSV.

import pandas as pd

df = pd.read_csv('survey_results_public.csv')

pd.set_option('display.max_columns', None)
print(df)

I am importing the core library, pandas, and shortening its name as pd. I then read the data using pandasread_csv function with the name of the target file: survey_results_public.csv.

Since the columns get truncated when handling a lengthy file in pandas, I prevent the default behavior using the set_option() function parsing the controls: 'display.max_columns' and None.

Let’s check the rows and columns using the shape attribute.

df.shape

We get a tuple.

(83439, 48)

That means the survey produced 83439 rows and 48 columns.

2️⃣ Split The Data

We have a 48-column data.

ResponseIdMainBranchEmploymentCountryUS_StateUK_CountryEdLevelAge1stCodeLearnCodeYearsCodeYearsCodeProDevTypeOrgSizeCurrencyCompTotalCompFreqLanguageHaveWorkedWithLanguageWantToWorkWithDatabaseHaveWorkedWithDatabaseWantToWorkWithPlatformHaveWorkedWithPlatformWantToWorkWithWebframeHaveWorkedWithWebframeWantToWorkWithMiscTechHaveWorkedWithMiscTechWantToWorkWithToolsTechHaveWorkedWithToolsTechWantToWorkWithNEWCollabToolsHaveWorkedWithNEWCollabToolsWantToWorkWithOpSysNEWStuckNEWSOSitesSOVisitFreqSOAccountSOPartFreqSOCommNEWOtherCommsAgeGenderTransSexualityEthnicityAccessibilityMentalHealthSurveyLengthSurveyEaseConvertedCompYearly

We can split the data along the Country column.

country_group = df.groupby('Country')
print(country_group)

We are splitting the data into groups of countries, each country becoming an index with the same columns as the parent data.

Printing the result, we get an object.

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001FF062FAD30>

The above line states that splitting the initial data produces a bunch of DataFrames attached to the DataFrameGroupBy object and stored at memory location 0x000001FF062FAD30.

Like any other DataFrame, we can view each column of the child group as we would in a dictionary. For example, we can inspect the languages each developer has worked with as follows.

3️⃣ Apply A Function

Knowing the percentage of PHP developers entails getting the total number of developers from each country and followed by PHP developers. We then divide the number of PHP developers by the total number of developers.

The first function to apply to the DataFrame is value_counts() which shows the numeric output total per column. Let’s use the function to check the languages coded per country.

country_group['LanguageHaveWorkedWith'].value_counts()

We get a series with multiple indices. Next, we use the apply() function to search for PHP amongst the languages used.

php_developers = country_group['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('PHP').sum())

print(php_developers)

The apply() function applies the lambda function to every subgroup of the data. The lambda function, in turn, runs through every row, checking for PHP amongst the languages, and gets the total from each using the sum() function.

You should get a Series of countries and the number of PHP developers who participated in the survey. You can use the set_option() method on rows to view all records per country.

pd.set_option('display.max_rows', None)

Next, let’s store the total number of developers from each country who participated in the survey.

all_developers = df['Country'].value_counts()

We get a long list numerically sorted in descending. Let’s proceed by creating a new DataFrame.

4️⃣ Combine The Data

The concat() function can help us create a DataFrame from all developers and PHP developers.

php_df = pd.concat([all_developers, php_developers], axis='columns', sort=False)

We get the following data:

CountryLanguageHaveWorkedWith
United States of America152882152
India105112044
Germany56251233
United Kingdom of Great Britain and Northern Ireland4475815
Canada3012483
………
Saint Kitts and Nevis10
Dominica10
Saint Vincent and the Grenadines10
Tuvalu11
Papua New Guinea10

Let’s rename the above columns: Country to Total Developers and LanguageHaveWorkedWith to PHP Developers.

php_df.rename(columns={'Country': 'Total Developers', 'LanguageHaveWorkedWith': 'PHP Developers'})

We can add the inplace attribute to permanently rename the columns.

php_df.rename(columns={'Country': 'Total Developers', 'LanguageHaveWorkedWith': 'PHP Developers'}, inplace=True) 

Lastly, let’s create a new column for the percentages.

php_df['% PHP Developers'] = ( php_df['PHP Developers']/php_df['Total Developers']) * 100

And print the resulting DataFrame.

print(php_df)

We get a new DataFrame.

Total DevelopersPHP Developers% PHP Developers
United States of America15288215214.076400
India10511204419.446294
Germany5625123321.920000
United Kingdom of Great Britain and Northern Ireland447581518.212291
Canada301248316.035857
…………
Saint Kitts and Nevis100.000000
Dominica100.000000
Saint Vincent and the Grenadines100.000000
Tuvalu11100.000000
Papua New Guinea100.000000

GroupBy and Keep Columns

Problem

GroupBy and Keep Columns in a dataframe

Approach 1: Using size() and reset_index(name=’count’)

Example:

import pandas as pd

data = {
  'Name': ['Book1', 'Book2', 'Book3', 'Book1', 'Book2'],
  'Type': ['ebook', 'paper', 'paper', 'ebook', 'paper'],
  'ID': [1, 2, 3, 1, 2]
}
df = pd.DataFrame(data)
df_2 = df.groupby(['Name','Type','ID']).size().reset_index(name='count')
print(df_2)

Output:

    Name   Type  ID  count
0  Book1  ebook   1      2
1  Book2  paper   2      2
2  Book3  paper   3      1

Approach 2: Count Columns Using transform and then Use drop_duplicates

Example:

import pandas as pd

data = {
  'Name': ['Book1', 'Book2', 'Book3', 'Book1', 'Book2'],
  'Type': ['ebook', 'paper', 'paper', 'ebook', 'paper'],
  'ID': [1, 2, 3, 1, 2]
}
df = pd.DataFrame(data)
df['Count'] = df.groupby(['Name'])['ID'].transform('count')
df = df.drop_duplicates()
print(df)

Output:

    Name   Type  ID  Count
0  Book1  ebook   1      2
1  Book2  paper   2      2
2  Book3  paper   3      1

Conclusion

This tutorial guided you to handle DataFrames using the groupby function. You can now sharpen your skills by manipulating more data, as shown in the examples section.


Learn Pandas the Fun Way by Solving Code Puzzles

If you want to boost your Pandas skills, consider checking out my puzzle-based learning book Coffee Break Pandas (Amazon Link).

Coffee Break Pandas Book

It contains 74 hand-crafted Pandas puzzles including explanations. By solving each puzzle, you’ll get a score representing your skill level in Pandas. Can you become a Pandas Grandmaster?

Coffee Break Pandas offers a fun-based approach to data science mastery—and a truly gamified learning experience.