5 Best Ways to Fill NaN Values with Mean in Pandas

πŸ’‘ Problem Formulation: When working with data in Python using the pandas library, dealing with missing values can be a common challenge. Specifically, the task at hand involves replacing these missing values, indicated by NaN, with the mean of the remaining data in a column. For instance, given a pandas DataFrame with some NaN values, the goal is to fill these NaNs with the calculated mean of the non-missing values in the same column.

Method 1: Using fillna() with mean()

In pandas, the fillna() function is used to fill missing values, and the mean() function calculates the mean of a series while skipping NaN. This method involves calculating the mean of each column and then calling fillna() with these means.

Here’s an example:

import pandas as pd

# Sample DataFrame with NaN values
df = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [2, None, None, 3]
})

# Calculate the mean of each column and fill NaNs
df.fillna(df.mean(), inplace=True)
print(df)

Output:

     A    B
0  1.0  2.0
1  2.0  2.5
2  2.333333  2.5
3  4.0  3.0

This code creates a pandas DataFrame with some missing values. Using df.mean(), we compute the mean while ignoring NaN values. The fillna() method is then called on the DataFrame, passing the mean values and updating the DataFrame in place, which replaces all NaNs with the calculated mean of their respective columns.

Method 2: Apply lambda function

This approach utilizes the apply() function and a lambda expression to replace NaN values with the mean of their respective column. This can provide additional flexibility or allow for chaining of operations.

Here’s an example:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [2, None, None, 3]
})

df = df.apply(lambda x: x.fillna(x.mean()))
print(df)

Output:

     A    B
0  1.0  2.0
1  2.0  2.5
2  2.333333  2.5
3  4.0  3.0

The code makes use of the apply() function which applies a function along an axis of the DataFrame. A lambda function is passed as an argument that uses fillna() on each column, filling NaNs with the mean of that specific column.

Method 3: Filling NaN with Mean for Selected Columns

It’s not always desirable to fill NaN values across all columns. This method focuses on selecting specific columns for filling NaN values with their mean.

Here’s an example:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [2, None, None, 3],
    'C': ['foo', 'bar', 'baz', 'qux']
})

# Selected columns
columns_to_fill = ['A', 'B']

# Loop over chosen columns and fill NaN with mean
for column in columns_to_fill:
    df[column].fillna(df[column].mean(), inplace=True)

print(df)

Output:

     A    B    C
0  1.0  2.0  foo
1  2.0  2.5  bar
2  2.333333  2.5  baz
3  4.0  3.0  qux

In this snippet, specific columns are first identified. A for-loop is then used to iterate through these columns, replacing NaN values with the mean for each column individually without affecting other non-specified columns.

Method 4: Using SimpleImputer from Scikit-learn

For those who prefer utilizing machine learning libraries, Scikit-learn’s SimpleImputer can be used to fill in NaN values. This method is particularly useful as it integrates well with Scikit-learn’s pipeline and model training processes.

Here’s an example:

import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [2, None, None, 3]
})

imputer = SimpleImputer(strategy='mean')
df.iloc[:, :] = imputer.fit_transform(df)

print(df)

Output:

          0    1
0  1.000000  2.0
1  2.000000  2.5
2  2.333333  2.5
3  4.000000  3.0

After importing SimpleImputer from Scikit-learn, an imputer object is created with a strategy to replace missing values using the mean. The fit_transform() method applies this strategy to the DataFrame in place, substitutin …What’s the best way to Increase Average Order Value (AOV)?