In this video and blog tutorial, we will learn how to apply a function to a Pandas data frame or series using the apply()
function. Using this tool, we can apply any kind of function to segregate our data and change it with a very limited amount of code.
Here’s the syntax from the official documentation:
Syntax
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
func: function
Function to apply to each column or row.
axis: {0 or ‘index’, 1 or ‘columns’}, default 0
Axis along which the function is applied:
- 0 or
'index'
: apply function to each column. - 1 or
'columns'
: apply function to each row.
raw: bool, default False
Determines if row or column is passed as a Series or ndarray object:
False
: passes each row or column as a Series to the function.True
: the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.
result_type: {‘expand’, ‘reduce’, ‘broadcast’, None}, default None
These only act when axis=1
(columns):
'expand'
: list-like results will be turned into columns.'reduce'
: returns a Series if possible rather than expanding list-like results. This is the opposite of'expand'
.'broadcast'
: results will be broadcast to the original shape of theDataFrame
, the original index and columns will be retained.
The default behavior (None
) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.
args: tuple
Positional arguments to pass to func in addition to the array/series.
**kwargs
Additional keyword arguments to pass as keywords arguments to func.
Returns: Series or DataFrame
Result of applying func along the given axis of the DataFrame.
An Introductory Example
To get started, let’s have a look at an introductory example:
import pandas as pd df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["Col 1", "Col 2", "Col 3"], index=["Row 1", "Row 2", "Row 3"])
This is the resulting DataFrame:
Col 1 | Col 2 | Col 3 | |
Row 1 | 1 | 2 | 3 |
Row 2 | 4 | 5 | 6 |
Row 3 | 7 | 8 | 9 |
In the first step, we import Pandas. Then, we create a Pandas data frame filled with values from 1 to 9. The output shows a typical Pandas data frame.
Now, we use the apply()
function:
df.apply(sum)
Output:
Col 1 12 Col 2 15 Col 3 18 dtype: int64
The apply()
function expects a function to be executed on our dataset. By putting in the keyword “sum
” we state that we want to use the built-in Python function “sum()
” to be applied to our dataset. The output shows the sum for each column and the data type of the output values. As we can see, the data type is “int64
” because the sums are integer values.
Defining the Axis Along Which to Apply the Function
In the example above, we did not state by which axis to apply the function and the output calculated the sum for each column. That’s because the optional “axis
” parameter is set to “0” by default which applies the function for each column.
Nevertheless, it is possible to change this parameter to “1”:
df.apply(sum, axis=1)
Output:
Row 1 6 Row 2 15 Row 3 24 dtype: int64
Here, we do the same as before, but this time, we use the “axis
” parameter and assign it to “1”. This way, we apply the sum()
function to each row instead of each column.
If you don’t like using “1” and “0”, you can apply the string "columns"
instead of “1” and the string "index"
instead of “0”:
df.apply(sum, axis="columns")
Output:
Row 1 6 Row 2 15 Row 3 24 dtype: int64
The output is the same as before where we assigned the “axis
” parameter “1”. It may be a bit confusing that the “axis
” parameter "columns"
outputs the values for each row and not for each column. That’s because we want to have the sum calculated by column here. For example, the result for “Row 1” is “6”. And that was computed by summing up the values of columns 1,2, and 3 in row 1.
Applying a Built-In Function
As mentioned earlier, the sum() function that we used in the example above to apply to our dataset is a built-in Python function. There are dozens of other built-in Python functions that we can use combined with the apply() function in Pandas. For example, the max() and min() function:
df.apply(max)
Output:
Col 1 7 Col 2 8 Col 3 9
df.apply(min)
Result:
Col 1 1 Col 2 2 Col 3 3 dtype: int64
As the names suggest, these functions compute the maximum and minimum values respectively.
In addition to that, we can also use built-in Pandas functions:
df.apply(pd.notnull)
Result:
Col 1 | Col 2 | Col 3 | |
Row 1 | True | True | True |
Row 2 | True | True | True |
Row 3 | True | True | True |
Here, we use the notnull()
function inside the apply()
method. This function detects if a value is existing or not. In other words, it checks if a value is not a “null” or “NA” value. If the value exists, it prints out “True
”. Since our dataset contained integer values exclusively, the output data frame is filled with “True
” values.
We state that we use a built-in Pandas function by adding “pd
.” before the function.
Similar to this, we are able to use functions from other libraries as well, for example, the Numpy library:
import numpy as np df.apply(np.sqrt)
Here’s the resulting DataFrame:
Col 1 | Col 2 | Col 3 | |
Row 1 | 1.000.000 | 1.414.214 | 1.732.051 |
Row 2 | 2.000.000 | 2.236.068 | 2.449.490 |
Row 3 | 2.645.751 | 2.828.427 | 3.000.000 |
First, we have to import the Numpy library to be able to use the Numpy functions. In this case, we pass the apply()
function the Numpy function sqrt()
which computes the square root of each value of the data frame.
🌍 Recommended Tutorial: How to Apply a Function to Each Cell in a Pandas DataFrame?
Applying Custom Functions
Built-in functions are great to apply to our datasets as they are easy to use and for a lot of use cases, there is a built-in function that perfectly works for our problem. However, sometimes we want to make some very specific calculations for which there’s no built-in function that we can use. For these cases, we define our own functions to apply
.
Let’s say we want to categorize our dataset. If a value is between 1 and 3, it is small, if a value is between 4 and 6, the value is normal, and if a value is bigger than 6, it is big.
There is no built-in function to achieve that, so we set up our own:
def categorize(x): if x <= 3: return "small" elif x > 3 and x <= 6: return "normal" else: return "big"
This function does exactly what we just described. Now, we append this function to the apply()
method. Let’s say we want to categorize column 1:
df["Col 1"].apply(categorize)
Result:
Row 1 small Row 2 normal Row 3 big Name: Col 1, dtype: object
We put the categorize()
function inside the apply()
method. The output shows the category for each value in column 1. If we compare that to the initial data frame, we see that the categorization took place successfully.
Using Lambda Functions
Sometimes, the custom functions are very short and there is only one expression. In this case, there is a more efficient and more convenient way to apply a function to our data frame and it is called a lambda function. A lambda function is a small function that fits into one line and has no function name.
Let’s say, we want to add “2” to each value in our data frame. We could apply a regular Python function like this:
def plus2(x): return x + 2
And then, append this function to our apply()
function:
df.apply(plus2)
Result:
Col 1 | Col 2 | Col 3 | |
Row 1 | 3 | 4 | 5 |
Row 2 | 6 | 7 | 8 |
Row 3 | 9 | 10 | 11 |
This approach works but it is unnecessary long. We achieve the same result with just one line using a lambda function:
df.apply(lambda x: x+2)
The result:
Col 1 | Col 2 | Col 3 | |
Row 1 | 3 | 4 | 5 |
Row 2 | 6 | 7 | 8 |
Row 3 | 9 | 10 | 11 |
A lambda function is structured like this:
lambda arguments: expression
It starts with the keyword “lambda
”, followed by the arguments and then the expression. Note, that we have only one expression here. That’s why lambda functions do not work for multiple expressions. The categorize()
function we created before would not be possible to be created with a lambda function because there are multiple expressions.
However, in this use case where we only add “2” to each value, the lambda function provides us with enough opportunities and is more elegant to use here.
Summary
All in all, the apply()
function is a very essential tool when working with Pandas. It allows us to perform any kind of function on our data frames, whether it is a built-in function from Python, Pandas, Numpy, or any other Python library or a customized function. Additionally, we can operate these functions on each axis which gives us even more opportunities to perform calculations and analyze our data.
For more tutorials about Pandas, other Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page or subscribe to our cheat sheet and Python email academy:
Happy Coding!