**π‘ Problem Formulation:** When analyzing data in Python with pandas, you may encounter missing values or NaNs within your dataset. The goal is to fill these NaNs by predicting their values based on the existing, non-missing data. Polynomial interpolation provides a way to estimate these missing values by fitting a polynomial to the known data points and using it to compute the NaN values. For example, given a pandas Series with NaNs, we want to generate a complete series where NaNs have been replaced with values estimated from a polynomial fitted to the non-NaN data.

## Method 1: Using `pandas.Series.interpolate`

with Polynomial Order

Python pandas offers an interpolation method that allows filling NaNs with polynomial interpolation directly through its `Series.interpolate`

function. You can specify the order of the polynomial with the `order`

parameter. This method is best for instances where your data is in a Series and indexed properly for interpolation.

Here’s an example:

import pandas as pd # Create a pandas Series with NaN values s = pd.Series([1, 3, np.nan, np.nan, 7, 10]) # Fill NaNs using a 2nd-order polynomial interpolation s_interpolated = s.interpolate(method='polynomial', order=2)

The output will be a Series with the NaNs replaced with the values estimated by a 2nd-order polynomial interpolation based on the surrounding non-NaN values.

This method takes advantage of pandas’ built-in functionality to efficiently perform the interpolation in just a few lines of code. Itβs suitable for simple cases where an entire column requires NaN filling and is easy to use for beginners.

## Method 2: Utilizing `scipy.interpolate`

for More Complex Interpolations

The `scipy.interpolate`

library provides a variety of interpolation methods, including polynomial interpolation. This method gives you more flexibility and provides options for more complex interpolations beyond those natively supported by pandas.

Here’s an example:

import pandas as pd from scipy.interpolate import interp1d import numpy as np # Create a pandas DataFrame with NaN values and an index df = pd.DataFrame({'x': [0, 1, 2, 3, 4], 'y': [1, 3, np.nan, np.nan, 10]}) # Fit a polynomial using the non-NaN data non_nan = df.dropna() f = interp1d(non_nan['x'], non_nan['y'], kind='quadratic') # Perform the interpolation for the NaN values df['y_interpolated'] = df.apply(lambda row: f(row['x']) if pd.isnull(row['y']) else row['y'], axis=1)

The output of this code will be a DataFrame with an additional column, ‘y_interpolated’, where NaNs have been filled with the evaluated polynomial function values.

This snippet demonstrates using SciPy to perform a more customizable interpolation than pandas’ built-in method. It is beneficial when you need more control over how the interpolation is calculated or need to apply more complex logic.

## Method 3: Leveraging the `numpy.polyfit`

Function

For those who prefer working with NumPy, the `numpy.polyfit`

function allows for polynomial fitting that can be used for interpolation purposes. Once you have your polynomial, you can apply its coefficients to the desired points to estimate the missing values.

Here’s an example:

import pandas as pd import numpy as np # Create a pandas DataFrame with NaN values df = pd.DataFrame({'x': np.arange(5), 'y': [2, np.nan, np.nan, 8, 10]}) # Remove NaNs and fit a polynomial of degree 2 x_non_nan = df['x'][~df['y'].isnull()] y_non_nan = df['y'][~df['y'].isnull()] coefficients = np.polyfit(x_non_nan, y_non_nan, 2) # Define the polynomial function poly = np.poly1d(coefficients) # Interpolate NaNs df['y'] = df['y'].combine_first(pd.Series(poly(df['x']), index=df.index))

The output will be the original DataFrame but with the ‘y’ values filled in where NaNs were present, using the polynomial defined by `poly`

.

This code uses NumPy to create a polynomial function that fits the non-missing data points, which is then used to fill in the missing values. This method gives users the ability to handle the interpolation manually and fine-tune the polynomial’s behaviour.

## Method 4: Custom Interpolation Function with `pandas.apply`

Sometimes the missing data isn’t evenly spaced, or the dataset is too complex for generic interpolation functions. In such cases, creating a custom interpolation function and using `pandas.apply`

may provide a better data-specific solution.

Here’s an example:

import pandas as pd import numpy as np # Define your custom interpolation function def custom_interpolation(val, x_points, y_points): # Implement the logic for polynomial interpolation here # This is a placeholder for the actual implementation return interpolated_value # Sample data with NaN values df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [np.nan, 2, np.nan, 4, np.nan]}) # Points without NaNs x_points = df['x'][~df['y'].isnull()] y_points = df['y'][~df['y'].isnull()] # Applying the custom interpolation for NaNs df['y'] = df['x'].apply(lambda x: custom_interpolation(x, x_points, y_points) if np.isnan(df['y'][x-1]) else df['y'][x-1])

The output will be the DataFrame but with the ‘y’ column’s NaN values filled in using the custom interpolation logic defined in `custom_interpolation`

.

This method may involve a more complex implementation but offers the highest degree of customization. Depending on the implementation of the `custom_interpolation`

function, it can be highly effective at accurately estimating missing data values in difficult datasets.

## Bonus One-Liner Method 5: Direct Polynomial Fit and Fill

For a quick-and-dirty one-liner solution, one can fit a polynomial and fill NaNs directly, provided you’re comfortable with a compact and less readable code. It’s not recommended for complex datasets but works for quick analysis or prototyping.

Here’s an example:

import pandas as pd import numpy as np # Sample data with NaN values df = pd.DataFrame({'y': [1, np.nan, 3, np.nan, 5]}) # Fit polynomial and fill NaNs in one line df['y'].interpolate(method=lambda x: np.poly1d(np.polyfit(x.dropna().index, x.dropna(), 2))(x.index))

The output will be the DataFrame with the ‘y’ column’s NaN values filled based on the one-liner polynomial interpolation.

This utilizes an anonymous function within interpolate’s `method`

parameter to directly apply polynomial fitting and interpolation. Itβs a neat trick for those who prefer concise code and are working with straightforward datasets.

## Summary/Discussion

**Method 1:**pandas.Series.interpolate. Strengths: Integrated into pandas, easy to use. Weaknesses: Limited to linear index and basic use cases.**Method 2:**scipy.interpolate. Strengths: More options and flexibility. Weaknesses: Requires additional understanding of SciPy’s interpolation functions.**Method 3:**numpy.polyfit. Strengths: Control via direct access to polynomial coefficients. Weaknesses: More manual steps needed to implement.**Method 4:**Custom Function with pandas.apply. Strengths: Highly customizable for complex data. Weaknesses: Can be more time-consuming to implement and test.**Bonus Method 5:**Direct Polynomial Fit and Fill. Strengths: Quick and compact code. Weaknesses: Least readable and not suitable for complex scenarios.