π‘ Problem Formulation: When working with datasets in Python, it’s common to encounter missing values in your DataFrame columns. This can lead to inaccuracies in your analysis or errors in your code. The goal is to replace missing values with the median of the column as it’s less sensitive to outliers than the mean. For example, given a DataFrame with some NaN values, we want to fill those NaNs with the median value of their respective columns.
Method 1: Using DataFrame.fillna()
with DataFrame.median()
This method leverages the pandas DataFrame.fillna()
function that allows filling NA/NaN values using specified methods, and DataFrame.median()
computes the median of the DataFrame’s numeric columns, which we will use to replace the missing values. This method is particularly useful when you want a quick and direct way to deal with missing values column by column.
Here’s an example:
import pandas as pd # Creating a DataFrame with NaN values df = pd.DataFrame({ 'A': [1, 2, 3, None, 5], 'B': [5, None, None, 4, 5], 'C': [None, 2, 3, 4, 5] }) # Filling the NaN values with median of the columns df.fillna(df.median(), inplace=True) print(df)
Output:
A B C 0 1.0 5.0 3.0 1 2.0 5.0 2.0 2 3.0 5.0 3.0 3 3.0 4.0 4.0 4 5.0 5.0 5.0
This code snippet first creates a DataFrame with some NaN values. Then, it calculates the median for each column and replaces NaN values with these medians. The inplace=True
argument updates the original DataFrame without the need to assign it to a new variable.
Method 2: Using DataFrame.apply()
for Selective Column Filling
Using DataFrame.apply()
allows you to apply a function along an axis of the DataFrame. When combined with a lambda function to calculate the median, you can selectively fill NaN values in columns with their respective medians. This method gives you more control when you have specific columns to target for filling NaN values.
Here’s an example:
import pandas as pd df = pd.DataFrame({ 'A': [1, None, 3, None, 5], 'B': [5, None, None, 4, 5], 'C': [None, None, 3, 4, 5] }) # Apply a lambda function to fill NaN values with the median df = df.apply(lambda col: col.fillna(col.median())) print(df)
Output:
A B C 0 1.0 5.0 4.0 1 3.0 4.5 4.0 2 3.0 4.5 3.0 3 3.0 4.0 4.0 4 5.0 5.0 5.0
In this example, apply()
is used to run a lambda function on each column. The lambda function fills NaN values for a column with the median of that column. Since apply()
works on a column-by-column basis, it’s convenient for applying column-specific operations.
Method 3: Filling with Median using DataFrame.transform()
The DataFrame.transform()
method is utilized when you wish to perform some operation on a DataFrame or Series and return an object that is the same size. By combining this with the median function, one can fill in missing values with the median while still being able to conduct additional transformations if needed.
Here’s an example:
import pandas as pd df = pd.DataFrame({ 'A': [1, None, 3, 4, 5], 'B': [5, None, None, 4, 5], 'C': [None, 2, 3, 4, 5] }) # Using transform to fill NaN values with the median. df = df.transform(lambda x: x.fillna(x.median())) print(df)
Output:
A B C 0 1.0 5.0 3.5 1 3.0 4.5 2.0 2 3.0 4.5 3.0 3 4.0 4.0 4.0 4 5.0 5.0 5.0
This snippet applies the transform()
method with a lambda function that fills NaNs with the median value for that series. It’s a versatile approach that not only fills the missing values but can also be extended to include additional transformations seamlessly.
Method 4: Using numpy.where()
with DataFrame.isnull()
This method involves using NumPy’s where()
function in tandem with pandas’ isnull()
to identify and fill NaN values with the median. This approach is useful when specific condition-based replacements are desired, and allows for customized replacement logic.
Here’s an example:
import pandas as pd import numpy as np df = pd.DataFrame({ 'A': [1, 2, 3, None, 5], 'B': [5, None, 1, 4, 5], 'C': [None, 2, 3, 4, 5] }) # Using numpy.where() to fill NaN values with the median for column in df.columns: df[column] = np.where(df[column].isnull(), df[column].median(), df[column]) print(df)
Output:
A B C 0 1.0 5.0 3.0 1 2.0 5.0 2.0 2 3.0 1.0 3.0 3 3.0 4.0 4.0 4 5.0 5.0 5.0
The code iterates over all columns, using np.where()
to check for NaN values with df[column].isnull()
. When a NaN is found, it is replaced by the median of that column, otherwise, the original value is retained. This method is powerful but can be more verbose compared to others.
Bonus One-Liner Method 5: Using DataFrame.where()
with notnull()
Another concise technique involves using pandas’ DataFrame.where()
in combination with notnull()
. This one-liner is an efficient way to maintain the original non-NaN values while replacing the NaNs with column medians directly.
Here’s an example:
import pandas as pd df = pd.DataFrame({ 'A': [1, 2, 3, None, 5], 'B': [5, None, 1, 4, 5], 'C': [None, 2, 3, 4, 5] }) # One-liner to fill NaN values with median using DataFrame.where() df = df.where(df.notnull(), df.median(), axis='columns') print(df)
Output:
A B C 0 1.0 5.0 3.0 1 2.0 4.0 2.0 2 3.0 1.0 3.0 3 3.0 4.0 4.0 4 5.0 5.0 5.0
This example showcases how DataFrame.where()
can be used to apply the median to NaN elements while keeping the original non-NaN values unchanged. The one-liner is elegant and efficient, but it may not be as readable for those unfamiliar with this style of pandas’ operations.
Summary/Discussion
- Method 1:
DataFrame.fillna()
withDataFrame.median()
. Quick and straightforward. Limited customization. - Method 2: Lambda function with
DataFrame.apply()
. More control over specific column operations. Slightly more complex. - Method 3:
DataFrame.transform()
with median. Versatile for additional transformations. Could be overkill for simple replacements. - Method 4:
numpy.where()
combined withDataFrame.isnull()
. High level of customization potential. More verbose than other methods. - Method 5: One-liner using
DataFrame.where()
withnotnull()
. Efficient and concise. May lack clarity for some users.