π‘ Problem Formulation: Data scientists often deal with missing values within datasets. In Python’s pandas library, these are represented as NaN values. To make a dataset complete for analysis, one common technique is to interpolate these missing values based on surrounding data. This article demonstrates five methods to perform interpolation of NaN values using the pandas library, starting from a DataFrame with missing values as the input and aiming for a DataFrame with the NaN values filled as the output.
Method 1: Linear Interpolation
Linear interpolation is the default method used by pandas for interpolating missing values. It computes the new value using a linear function, which essentially draws a straight line between the two points directly before and after the missing value. The syntax is DataFrame.interpolate(method='linear')
.
Here’s an example:
import pandas as pd df = pd.DataFrame({'A': [1, None, 3]}) print(df.interpolate())
Output:
A 0 1.0 1 2.0 2 3.0
This code initializes a pandas DataFrame with an intentional NaN value and then uses the interpolate()
function to estimate the missing value. In this case, the function calculates the midpoint (2.0) between the surrounding numeric values (1.0 and 3.0).
Method 2: Polynomial Interpolation
Polynomial interpolation fits a polynomial of a specified degree to the data, which can better capture the overall trend in comparison to linear interpolation. The method is invoked via DataFrame.interpolate(method='polynomial', order=n)
where ‘n’ is the polynomial degree.
Here’s an example:
import pandas as pd df = pd.DataFrame({'A': [1, None, 3, 4]}) print(df.interpolate(method='polynomial', order=2))
Output:
A 0 1.0 1 2.0 2 3.0 3 4.0
By setting the order
argument to 2, we tell pandas to use a quadratic polynomial to estimate the missing value. The DataFrame fills the NaN value considering the curvature of its data points.
Method 3: Time Series Interpolation
When working with time-series data, the ‘time’ method allows interpolating missing values considering the index’s time-related spacing. It requires the DataFrame index to be a DateTimeIndex. Use DataFrame.interpolate(method='time')
.
Here’s an example:
import pandas as pd import numpy as np time_index = pd.date_range('20200101', periods=3, freq='D') df = pd.DataFrame({'A': [1, np.nan, 3]}, index=time_index) print(df.interpolate(method='time'))
Output:
A 2020-01-01 1.0 2020-01-02 2.0 2020-01-03 3.0
This example shows interpolation of a time series where the missing value is estimated based on the index’s time intervals. It fills the NaN such that the value corresponds proportionately to the time gap between non-NaN entries.
Method 4: Spline Interpolation
Spline interpolation uses piecewise polynomials (splines) to fill missing values and can be better for smoothing the dataset. It can be used by calling DataFrame.interpolate(method='spline', order=n)
.
Here’s an example:
import pandas as pd df = pd.DataFrame({'A': [1, None, 3, 10]}) print(df.interpolate(method='spline', order=3))
Output:
A 0 1.000000 1 1.880415 2 3.000000 3 10.000000
Here, the spline interpolates with a cubic function (since the order is set to 3), creating a smoothed curve through the data points to estimate the missing value.
Bonus One-Liner Method 5: Nearest-Neighbor Interpolation
Nearest-neighbor interpolation is a quick method that fills missing values with the closest non-null value. It is used via DataFrame.interpolate(method='nearest')
.
Here’s an example:
import pandas as pd df = pd.DataFrame({'A': [3, None, None, 2]}) print(df.interpolate(method='nearest'))
Output:
A 0 3.0 1 3.0 2 2.0 3 2.0
The interpolate()
function with ‘nearest’ fills each NaN with the value from the nearest non-NaN entry. In this DataFrame, NaNs are filled with 3 and 2, respectively, mirroring the closest existing values.
Summary/Discussion
- Method 1: Linear interpolation. It’s straightforward and fast. Suitable for evenly spaced datasets. However, it may oversimplify the data structure, especially if the data are not linear.
- Method 2: Polynomial interpolation. Good for capturing nonlinear trends. The degree of the polynomial can be adjusted for different curves. However, higher-order polynomials can lead to overfitting and whimsical results between known data points.
- Method 3: Time series interpolation. Ideal for time-series data considering the temporal gaps. Most effective when the trend is time-dependent. Not applicable to data not indexed by time.
- Method 4: Spline interpolation. Offers a smooth curve that fits the data well. Suitable for functions with fluctuating trends. But it can be complex and computationally intensive for large datasets or high spline orders.
- Bonus Method 5: Nearest-neighbor interpolation. Simple and quick. Best used when the data points are closely related. However, it does not take the actual distance between points into account and might not reflect the true underlying trend.