The name agg is short for aggregate. To aggregate is to summarize many observations into a single value that represents a certain aspect of the observed data.
The .agg()
function can process a dataframe, a series, or a grouped dataframe. It can execute many aggregation functions, e.g. βmean
β, βmax
β,β¦ in a single call along one of the axis. It can also execute lambda functions. Read on for examples.
We will use a dataset of FIFA players. Find the dataset here.
Basic Setup using Jupyter Notebook
Letβs start by importing pandas and loading our dataset.
import pandas as pd df_fifa_soccer_players = pd.read_csv('fifa_cleaned.csv') df_fifa_soccer_players.head()
To increase readability, we will work with a subset of the data. Let’s create the subset by selecting the columns we want to have in our subset and create a new dataframe.
df_fifa_soccer_players_subset = df_fifa_soccer_players[['nationality', 'age', 'height_cm', 'weight_kgs', 'overall_rating', 'value_euro', 'wage_euro']] df_fifa_soccer_players_subset.head()
Basic Aggregation
Pandas provides a variety of built-in aggregation functions. For example, pandas.DataFrame.describe
. When applied to a dataset, it returns a summary of statistical values.
df_fifa_soccer_players_subset.describe()
To understand aggregation and why it is helpful, let’s have a closer look at the data returned.
Example: Our dataset contains records for 17954 players. The youngest player is 17 years of age and the oldest player is 46 years old. The mean age is 25 years. We learn that the tallest player is 205 cm tall and the average player’s height is around 175 cm. With a single line of code, we can answer a variety of statistical questions about our data. The describe
function identifies numeric columns and performs the statistical aggregation for us. Describe also excluded the column nationality
that contains string values.
To aggregate is to summarize many observations into a single value that represents a certain aspect of the observed data.
Pandas provides us with a variety of pre-built aggregate functions.
Functions | Description |
mean() | returns the mean of a set of values |
sum() | returns the sum of a set of values |
count() | returns the count of a set of values |
std() | returns the standard deviation of a set of values |
min() | returns the smallest value of a set of values |
max() | returns the largest value of a set of values |
describe() | returns a collection of statistical values of a set of values |
size() | returns the size of a set of values |
first() | returns the first value of a set of values |
last() | returns the last value of a set of values |
nth() | returns the nth value of a set of values |
sem() | returns the standard error of the mean of a set of value |
var() | returns the variance of a set of values |
nunique() | returns the count of unique values of a set of values |
Let’s use another function from the list above. We can be more specific and request the βsum
β for the βvalue_euro
β series. This column contains the market value of a player. We select the column or series βvalue_euro
β and execute the pre-build sum()
function.
df_fifa_soccer_players_subset['value_euro'].sum() # 43880780000.0
Pandas returned us the requested value. Let’s get to know an even more powerful pandas method for aggregating data.
The βpandas.DataFrame.aggβ Method
Function Syntax
The .agg()
function can take in many input types. The output type is, to a large extent, determined by the input type. We can pass in many parameters to the .agg()
function.
The βfunc
β parameter:
- is by default set to
None
- contains one or many functions that aggregate the data
- supports pre-defined pandas aggregate functions
- supports lambda expressions
- supports the
dataframe.apply()
method for specific function calls
The βaxis
β parameter:
- is by default set to 0 and applies functions to each column
- if set to 1 applies functions to rows
- can hold values:
0
or βindex
β1
or βcolumns
β
What about *args
and **kwargs
:
- we use these placeholders, if we do not know in advance how many arguments we will need to pass into the function
- when arguments are of the same type, we use
*args
- When arguments are of different types, we use
**kwargs
.
Agg method on a Series
Letβs see the .agg()
function in action. We request some of the pre-build aggregation functions for the βwage_euro
β series. We use the function parameter and provide the aggregate functions βwe want to execute as a list. And letβs save the resulting series in a variable.
wage_stats = df_fifa_soccer_players_subset['wage_euro'].agg(['sum', 'min', 'mean', 'std', 'max']) print(wage_stats)
Pandas uses scientific notation for large and small floating-point numbers. To convert the output to a familiar format, we must move the floating point to the right as shown by the plus sign. The number behind the plus sign represents the amount of steps.
Let’s do this together for some values.
The sum of all wages is 175,347,000β¬ (1.753470e+08)
The mean of the wages is 9902.135β¬ (9.902135e+03)
We executed many functions on a series input source. Thus our variable βwage_stats
β is of the type Series
because.
type(wage_stats) # pandas.core.series.Series
See below how to extract, for example, the βmin
β value from the variable and the data type returned.
wage_stats_min = wage_stats['min'] print(wage_stats_min) # 1000.0 print(type(wage_stats_min)) # numpy.float64
The data type is now a scalar.
If we execute a single function on the same data source (series), the type returned is a scalar.
wage_stats_max = df_fifa_soccer_players_subset['wage_euro'].agg('max') print(wage_stats_max) # 565000.0 print(type(wage_stats_max)) # numpy.float64
Letβs use one more example to understand the relation between the input type and the output type.
We will use the function βnunique
β which will give us the count of unique nationalities. Letβs apply the function in two code examples. We will reference the series βnationality
β both times. The only difference will be the way we pass the function βnunique
β into our agg()
function.
nationality_unique_series = df_fifa_soccer_players_subset['nationality'].agg({'nationality':'nunique'}) print(nationality_unique_series) # nationality 160 # Name: nationality, dtype: int64 print(type(nationality_unique_series)) # pandas.core.series.Series
When we use a dictionary to pass in the βnunique
β function, the output type is a series.
nationality_unique_int = df_fifa_soccer_players_subset['nationality'].agg('nunique') print(nationality_unique_int) # 160 print(type(nationality_unique_int)) # int
When we pass the βnunique
β function directly into agg()
the output type is an integer.
Agg method on a DataFrame
Passing the aggregation functions as a Python list
One column represents a series. We will now select two columns as our input and so work with a dataframe.
Letβs select the columns βheight_cm
β and βweight_kgs
β.
We will execute the functions min()
, mean()
and max()
. To select a two-dimensional data (dataframe), we need to use double brackets. We will round the results to two decimal points.
Letβs store the result in a variable.
height_weight = df_fifa_soccer_players_subset[['height_cm', 'weight_kgs']].agg(['min', 'mean', 'max']).round(2) print(height_weight)
We get a data frame containing rows and columns. Let’s confirm this observation by checking the type of the βheight_weight
β variable.
print(type(height_weight)) # pandas.core.frame.DataFrame
We will now use our newly created dataframe named βheight_weight
β to use the βaxis
β parameter. The entire dataframe contains numeric values.
We define the functions and pass in the axis
parameter. I used the count()
and sum()
functions to show the effect of the axis
parameter. The resulting values make little sense. This is also the reason why I do not rename the headings to restore the lost column names.
height_weight.agg(['count', 'sum'], axis=1)
We aggregated along the rows. Returning the count of items and the sum of item values in each row.
Passing the aggregation functions as a python dictionary
Now let’s apply different functions to the individual sets in our dataframe. We select the sets βoverall_rating
β and βvalue_euro
β. We will apply the functions std()
, sem()
and mean()
to the βoverall_rating
β series, and the functions min()
and max()
to the βvalue_euro
β series.
rating_value_euro_dict = df_fifa_soccer_players_subset[['overall_rating', 'value_euro']].agg({'overall_rating':['std', 'sem', 'mean'], 'value_euro':['min', 'max']}) print(rating_value_euro_dict)
The dataframe contains calculated and empty (NaN) values. Letβs quickly confirm the type of our output.
print(type(rating_value_euro_dict)) # pandas.core.frame.DataFrame
Passing the aggregation functions as a Python tuple
We will now repeat the previous example.
We will use tuples instead of a dictionary to pass in the aggregation functions. Tuple have limitations. We can only pass one aggregation function within a tuple. We also have to name each tuple.
rating_value_euro_tuple = df_fifa_soccer_players_subset[['overall_rating', 'value_euro']].agg(overall_r_std=('overall_rating', 'std'),overall_r_sem=('overall_rating', 'sem'),overall_r_mean=('overall_rating', 'mean'),value_e_min=('value_euro', 'min'),value_e_max=('value_euro', 'max')) print(rating_value_euro_tuple)
Agg method on a grouped DataFrame
Grouping by a single column
The βgroupby
β method creates a grouped dataframe. We will now select the columns βage
β and βwage_euro
β and group our dataframe using the column βage
β. On our grouped dataframe we will apply the agg()
function using the functions count()
, min()
, max()
and mean()
.
age_group_wage_euro = df_fifa_soccer_players_subset[['age', 'wage_euro']].groupby('age').aggage(['count', 'min', 'max', 'mean']) print(age_group_wage_euro)
Every row represents an age group. The count value shows how many players fall into the age group. The min, max and mean values aggregate the data of the age-group members.
Multiindex
One additional aspect of a grouped dataframe is the resulting hierarchical index. We also call it multiindex.
We can see that the individual columns of our grouped dataframe are at different levels. Another way to view the hierarchy is to request the columns for the particular dataset.
print(age_group_wage_euro.columns)
Working with a multiindex is a topic for another blog post. To use the tools that we have discussed, let’s flatten the multiindex and reset the index. We need the following functions:
droplevel()
reset_index()
age_group_wage_euro_flat = age_group_wage_euro.droplevel(axis=1, level=0).reset_index() print(age_group_wage_euro_flat.head())
The resulting dataframe columns are now flat. We lost some information during the flattening process. Letβs rename the columns and return some of the lost context.
age_group_wage_euro_flat.columns = ['age', 'athlete_count', 'min_wage_euro', 'max_wage_euro', 'mean_wage_euro'] print(age_group_wage_euro_flat.head())
Grouping by multiple columns
Grouping by multiple columns creates even more granular subsections.
Letβs use βage
β as the first grouping parameter and βnationality
β as the second. We will aggregate the resulting group data using the columns βoverall_rating
β and βheight_cm
β. We are by now familiar with the aggregation functions used in this example.
df_fifa_soccer_players_subset.groupby(['age', 'nationality']).agg({'overall_rating':['count', 'min', 'max', 'mean'], 'height_cm':['min', 'max', 'mean']})
Every age group contains nationality groups. The aggregated athletes data is within the nationality groups.
Custom aggregation functions
We can write and execute custom aggregation functions to answer very specific questions.
Letβs have a look at the inline lambda functions.
π‘ Lambda functions are so-called anonymous functions. They are called this way because they do not have a name. Within a lambda function, we can execute multiple expressions. We will go through several examples to see lambda functions in action.
In pandas lambda functions live inside the βDataFrame.apply()
β and the βSeries.appy()
β methods. We will use the DataFrame.appy()
method to execute functions along both axes. Letβs have a look at the basics first.
Function Syntax
The DataFrame.apply()
function will execute a function along defined axes of a DataFrame. The functions that we will execute in our examples will work with Series objects passed into our custom functions by the apply()
method. Depending on the axes that we will select, the Series will comprise out of a row or a column or our data frame.
The βfunc
β parameter:
- contains a function applied to a column or a row of the data frame
The βaxis
β parameter:
- is by default set to 0 and will pass a series of column data
- if set to 1 will pass a series of the row data
- can hold values:
- 0 or β
index
β - 1 or β
columns
β
- 0 or β
The βraw
β parameter:
- is a boolean value
- is by default set to
False
- can hold values:
False
-> a Series object is passed to the functionTrue
-> andarray
object is passed to the function
The βresult_type
β parameter:
- can only apply when the axis is 1 or β
columns
β - can hold values:
- β
expand
β βreduceβ
- β
broadcast
β
- β
The βargs()
β parameter:
- additional parameters for the function as tuple
The **kwargs
parameter:
- additional parameters for the function as key-value pairs
Filters
Letβs have a look at filters. They will be very handy as we explore our data.
In this code example, we create a filter named filt_rating
. We select our dataframe and the column overall_rating
. The condition >= 90
returns True
if the value in the overall_rating
column is 90 or above.
Otherwise, the filter returns False
.
filt_rating = df_fifa_soccer_players_subset['overall_rating'] >= 90 print(filt_rating)
The result is a Series object containing the index, and the correlated value of True
or False
.
Letβs apply the filter to our dataframe. We call the .loc
method and pass in the filter’s name as a list item. The filter works like a mask. It covers all rows that have the value False
. The remaining rows match our filter criteria of overall_rating >= 90
.
df_fifa_soccer_players_subset.loc[filt_rating]
Lambda functions
Letβs recreate the same filter using a lambda function. We will call our filter filt_rating_lambda
.
Letβs go over the code. We specify the name of our filter and call our dataframe. Pay attention to the double square brackets. We use them to pass a dataframe and not a Series object to the .appy()
method.
Inside .apply()
we use the keyword βlambda
β to show that we are about to define our anonymous function. The βx
β represents the Series passed into the lambda function.
The series contains the data from the overall_rating
column. After the semicolumn, we use the placeholder x
again. Now we apply a method called ge()
. It represents the same condition we used in our first filter example β>=
β (greater or equal).
We define the integer value 90 and close the brackets on our apply function. The result is a dataframe that contains an index and only one column of boolean values. To convert this dataframe to a Series we use the squeeze()
method.
filt_rating_lambda = df_fifa_soccer_players_subset[['overall_rating']].apply(lambda x:x.ge(90)).squeeze() print(filt_rating_lambda)
Letβs use our filter. Great, we get the same result as in our first filter example.
df_fifa_soccer_players_subset.loc[filt_rating_lambda]
We now want to know how many players our filter returned. Letβs first do it without a lambda function and then use a lambda function to see the same result. We are counting the lines or records.
df_fifa_soccer_players_subset.loc[filt_rating_lambda].count()
df_fifa_soccer_players_subset.apply(lambda x:x.loc[filt_rating_lambda]).count()
Great. Now letβs put us in a place where we actually need to use the apply()
method and a lambda function. We want to use our filter on a grouped data-frame.
Letβs group by nationality to see the distribution of these amazing players. The output will contain all columns. This makes the code easier to read.
df_fifa_soccer_players_subset.groupby('nationality').loc[filt_rating_lambda]
Pandas tells us in this error message that we can not use the βloc
β method on a grouped dataframe object.
Letβs now see how we can solve this problem by using a lambda function. Instead of using the βloc
β function on the grouped dataframe we use the apply()
function. Inside the apply()
function we define our lambda function. Now we use the βloc
β method on the variable βx
β and pass our filter.
df_fifa_soccer_players_subset.groupby('nationality').apply(lambda x:x.loc[filt_rating_lambda])
Axis parameter of the apply() function
Now letβs use the axis
parameter to calculate the Body-Mass-Index (BMI) for these players. Until now we have used the lambda functions on the columns of our data.
The βx
β variable was a representation of the individual column. We set the axis parameter to β1
β. The βx
β variable in our lambda function will now represent the individual rows of our data.
Before we calculate the BMI letβs create a new dataframe and define some columns. We will call our new dataframe βdf_bmi
β.
df_bmi = df_fifa_soccer_players_subset.groupby('nationality')[['age', 'height_cm', 'weight_kgs']].apply(lambda x:x.loc[filt_rating_lambda]) print(df_bmi)
Now letβs reset the index.
df_bmi = df_bmi.reset_index() print(df_bmi)
We calculate the BMI as follows. We divide the weight in kilogram by the square of the height in meters.
Let’s have a closer look at the lambda function. We define the βaxis
β to be β1
β. The βx
β variable now represents a row. We need to use specific values in each row. To define these values, we use the variable βx
β and specify a column name. At the beginning of our code example, we define a new column named βbmi
β. And at the very end, we round the results.
df_bmi['bmi'] = df_bmi.apply(lambda x:x['weight_kgs']/((x['height_cm']/100)**2), axis=1).round() print(df_bmi)
Great! Our custom function worked. The new BMI column contains calculated values.
Conclusion
Congratulations on finishing the tutorial. I wish you many great and small insights for your future data projects. I include the Jupyter-Notebook file, so you can experiment and tweak the code.