5 Best Ways to Fill NaN with Linear Interpolation in Python’s Pandas

πŸ’‘ Problem Formulation: When working with datasets in Pandas, missing values can appear as NaN (Not a Number) and may hinder statistical analysis or visualizations. An effective way to address this is by filling these NaN values using linear interpolation, where the gaps are filled with values that form a straight line between the available data points. The input is a Pandas DataFrame with NaN values, and the desired output is the same DataFrame with NaN values filled by linearly interpolated data.

Method 1: Basic Interpolation Using interpolate()

In Pandas, the interpolate() method provides a quick and efficient way to perform linear interpolation. It works on a Series or DataFrame object and interpolates the values according to the method specified. In its simplest form, without any arguments, it assumes linear interpolation by default.

Here’s an example:

import pandas as pd

# Creating a DataFrame with NaN values
df = pd.DataFrame({'A': [1, None, None, 4]})
# Applying linear interpolation
df['A'] = df['A'].interpolate()

print(df)

Output:

     A
0  1.0
1  2.0
2  3.0
3  4.0

This code snippet creates a DataFrame with a column ‘A’ containing NaN values. By calling df['A'].interpolate(), it fills the NaN positions with linearly spaced values between the existing numbers, resulting in a continuous sequence.

Method 2: Interpolation with a Limit

The interpolate() method can also be constrained with a limit, which restricts the number of NaN values that will be filled. This can be useful when you want to interpolate only a certain number of gaps in the data, rather than all NaN values.

Here’s an example:

import pandas as pd

# DataFrame with many NaN values
df = pd.DataFrame({'A': [1, None, None, 4, None, None, 7, 8]})
# Applying linear interpolation with a limit
df['A'] = df['A'].interpolate(limit=1)

print(df)

Output:

     A
0  1.0
1  2.0
2  NaN
3  4.0
4  5.0
5  NaN
6  7.0
7  8.0

This example demonstrates the linear interpolation with a limit. The limit=1 argument in the interpolate() method ensures that only one NaN value following a valid entry is filled.

Method 3: Specifying the Interpolation Axis

By default, interpolation operates along the vertical axis (axis=0). However, you can specify the axis along which to interpolate in case you are dealing with a multi-dimensional DataFrame and you need to interpolate horizontally (axis=1).

Here’s an example:

import pandas as pd

# DataFrame with NaN values across multiple columns
df = pd.DataFrame({'A': [1, 2], 'B': [None, 4]})
# Applying linear interpolation across columns
df = df.interpolate(axis=1)

print(df)

Output:

     A    B
0  1.0  1.0
1  2.0  4.0

This code snippet shows how to apply linear interpolation across columns (horizontally) by setting axis=1. It fills NaN values in row direction, assuming linear relationships across the columns.

Method 4: Interpolation with Different Methods

While linear is the default, Pandas’ interpolate() method also supports various other interpolation methods. For instance, you might choose ‘quadratic’, ‘cubic’, or other polynomial or spline interpolation, depending on your dataset’s nature.

Here’s an example:

import pandas as pd

# DataFrame with NaN values
df = pd.DataFrame({'A': [1, None, None, 16]})
# Applying cubic interpolation
df['A'] = df['A'].interpolate(method='cubic')

print(df)

Output:

      A
0   1.0
1   4.0
2   9.0
3  16.0

This code snippet uses cubic interpolation, which is more suitable for data that changes at a non-linear rate. The method='cubic' argument tells Pandas to apply a cubic interpolation instead of a linear one.

Bonus One-Liner Method 5: Using apply() for Row-Wise Interpolation

If you need to interpolate based on values in each row independently, you can use the apply() function across the DataFrame along with the interpolate() method.

Here’s an example:

import pandas as pd

# DataFrame with NaN values in different rows
df = pd.DataFrame({'A': [1, None, 3], 'B': [1, 2, None]})
# Applying row-wise linear interpolation
df = df.apply(lambda x: x.interpolate(), axis=1)

print(df)

Output:

     A    B
0  1.0  1.0
1  2.0  2.0
2  3.0  3.0

This one-liner demonstrates row-wise interpolation, where each row’s NaN is filled considering the row’s data points. It is achieved by applying a lambda function that calls interpolate() for each row.

Summary/Discussion

  • Method 1: Basic Interpolation. Simple to use, good default behavior. May not be suitable for complex datasets with special interpolation requirements.
  • Method 2: Interpolation with a limit. Allows for partial gap filling. May leave out necessary interpolations if the limit is too low.
  • Method 3: Specifying Axis. Gives control over row/column interpolation. Requires understanding of data structure and axis parameter.
  • Method 4: Different Methods. Provides flexibility with multiple interpolation types. Choosing the wrong method may lead to inaccurate filling.
  • Method 5: Row-Wise with apply(). Useful for certain data arrangements. Could be computationally expensive with large datasets.