Pandas DataFrame Missing Data Handling – isna(), isnull(), notna(), notnull(), pad() and replace() – Part 12

Rate this post

The Pandas DataFrame/Series has several methods to handle Missing Data. When applied to a DataFrame/Series, these methods evaluate and modify the missing elements.

This is Part 12 of the DataFrame methods series:

  • Part 1 focuses on the DataFrame methods abs(), all(), any(), clip(), corr(), and corrwith().
  • Part 2 focuses on the DataFrame methods count(), cov(), cummax(), cummin(), cumprod(), cumsum().
  • Part 3 focuses on the DataFrame methods describe(), diff(), eval(), kurtosis().
  • Part 4 focuses on the DataFrame methods mad(), min(), max(), mean(), median(), and mode().
  • Part 5 focuses on the DataFrame methods pct_change(), quantile(), rank(), round(), prod(), and product().
  • Part 6 focuses on the DataFrame methods add_prefix(), add_suffix(), and align().
  • Part 7 focuses on the DataFrame methods at_time(), between_time(), drop(), drop_duplicates() and duplicated().
  • Part 8 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
  • Part 9 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
  • Part 10 focuses on the DataFrame methods reset_index(), sample(), set_axis(), set_index(), take(), and truncate()
  • Part 11 focuses on the DataFrame methods backfill(), bfill(), fillna(), dropna(), and interpolate()
  • Part 12 focuses on the DataFrame methods isna(), isnull(), notna(), notnull(), pad() and replace()
  • Part 13 focuses on the DataFrame methods drop_level(), pivot(), pivot_table(), reorder_levels(), sort_values() and sort_index()
  • Part 14 focuses on the DataFrame methods nlargest(), nsmallest(), swap_level(), stack(), unstack() and swap_axes()
  • Part 15 focuses on the DataFrame methods melt(), explode(), squeeze(), to_xarray(), t() and transpose()
  • Part 16 focuses on the DataFrame methods append(), assign(), compare(), join(), merge() and update()
  • Part 17 focuses on the DataFrame methods asfreq(), asof(), shift(), slice_shift(), tshift(), first_valid_index(), and last_valid_index()
  • Part 18 focuses on the DataFrame methods resample(), to_period(), to_timestamp(), tz_localize(), and tz_convert()
  • Part 19 focuses on the visualization aspect of DataFrames and Series via plotting, such as plot(), and plot.area().
  • Part 20 focuses on continuing the visualization aspect of DataFrames and Series via plotting such as hexbin, hist, pie, and scatter plots.
  • Part 21 focuses on converting one data type to/from another data type.
  • Part 22 focuses on converting one data type to/from another data type.
  • Part 23 focuses on converting one data type to/from another data type.
  • Part 24 focuses on converting one data type to/from another data type.
  • Part 25 focuses on converting one data type to/from another data type.

Preparation

Before any data manipulation can occur, two (2) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import numpy as np 

DataFrame isna() & Dataframe isnull()

The DataFrame isna() and isnull() methods return Boolean (True/False) values in the same shape as the DataFrame/Series passed. If any empty values are of the following type, they will resolve to True.

  • None
  • NaN
  • NaT
  • NA

All other values (valid data) will resolve to False.

πŸ’‘ Note: Any empty strings or numpy.inf are not considered empty unless use_inf_as_na is set to True.

The syntax for these methods is as follows:

DataFrame.isna()
DataFrame.isnull()

Parameters – These methods contain no parameters.

For this example, three (3) temperatures over three (3) days for Anchorage, Alaska, save to a DataFrame. Unfortunately, some temperatures did not accurately record.

The code below returns a new DataFrame containing True values in the same position as the missing temperatures and False in the remainder.

df_temps = pd.DataFrame({'Day-1':  [np.nan, 11, 12], 
                         'Day-2':  [13, 14, pd.NaT],
                         'Day-3':  [None, 15, 16]},
                         index=['Morning', 'Noon', 'Evening'])
print(df_temps)

result = df_temps.isna()
print(result)
  • Line [1] creates a dictionary of lists and saves it to df_temps.
  • Line [2] outputs the DataFrame to the terminal.
  • Line [3] uses isna() to set the empty values (np.nan, pd.NaT, None) to True and the remainder (valid values) to False. This output saves to the result variable.
  • Line [4] outputs the result to the terminal.

Output

original df_temps

 Day-1Day-2 Day-3
Morning   NaN   13NaN   
Noon     11.0   1415.0
Evening  12.0  NaT  16.0

result

 Day-1Day-2 Day-3
Morning   TrueFalseTrue
Noon     FalseFalseFalse
Evening  FalseTrueFalse
df_temps = pd.DataFrame({'Day-1':  [np.nan, 11, 12], 
                   'Day-2':  [13, 14, pd.NaT],
                   'Day-3':  [None, 15, 16]},
                   index=['Morning', 'Noon', 'Evening'])
print(df_temps)

result = df_temps.isnull()
print(result)
  • Line [1] creates a dictionary of lists and saves it to df_temps.
  • Line [2] outputs the DataFrame to the terminal.
  • Line [3] uses isnull() to set the empty values (np.nan, pd.NaT, None) to True and the remainder (valid values) to False. This output saves to the result variable.
  • Line [4] outputs the result to the terminal.

Output

original df_temps

 Day-1Day-2 Day-3
Morning   NaN   13NaN   
Noon     11.0   1415.0
Evening  12.0  NaT  16.0

result

 Day-1Day-2 Day-3
Morning   TrueFalseTrue
Noon     FalseFalseFalse
Evening  FalseTrueFalse

πŸ’‘ Note: The isnull() method is an alias of the isna() method. The output from both examples is identical.


DataFrame notna() & notnull()

The DataFrame notna() and notnull() methods return Boolean (True/False) values. These values returned are in the same shape as the DataFrame/Series passed. If any empty values are of the following type, they will resolve to False.

  • None
  • NaN
  • NaT
  • NA

All other values that are not of the above type (valid data) will resolve to True.

The syntax for these methods is as follows:

DataFrame.notna()
DataFrame.notnull()

Parameters: These methods contain no parameters.

For this example, three (3) temperatures over three (3) days for Anchorage, Alaska, save to a DataFrame. Unfortunately, some temperatures did not accurately record.

The code below returns a new DataFrame containing True values in the same position as the missing temperatures and False in the remainder.

df_temps = pd.DataFrame({'Day-1':  [np.nan, 11, 12], 
                   'Day-2':  [13, 14, pd.NaT],
                   'Day-3':  [None, 15, 16]},
                   index=['Morning', 'Noon', 'Evening'])
print(df_temps)

result = df_temps.notna()
print(result)
  • Line [1] creates a dictionary of lists and saves it to df_temps.
  • Line [2] outputs the DataFrame to the terminal.
  • Line [3] uses notna() to set the empty values (np.nan, pd.NaT, None) to False and the remainder (valid values) to True. This output saves to the result variable.
  • Line [4] outputs the result to the terminal.

Output

original df_temps

 Day-1Day-2 Day-3
Morning   NaN   13NaN   
Noon     11.0   1415.0
Evening  12.0  NaT  16.0

result

 Day-1Day-2 Day-3
Morning   FalseTrueFalse
Noon     TrueTrueTrue
Evening  TrueFalseTrue
df_temps = pd.DataFrame({'Day-1':  [np.nan, 11, 12], 
                   'Day-2':  [13, 14, pd.NaT],
                   'Day-3':  [None, 15, 16]},
                   index=['Morning', 'Noon', 'Evening'])
print(df_temps)

result = df_temps.notnull()
print(result)
  • Line [1] creates a dictionary of lists and saves it to df_temps.
  • Line [2] outputs the DataFrame to the terminal.
  • Line [3] uses notnull() to set the empty values (np.nan, pd.NaT, None) to False and the remainder (valid values) to True. This output saves to the result variable.
  • Line [4] outputs the result to the terminal.

Output

original df_temps

 Day-1Day-2 Day-3
Morning   NaN   13NaN   
Noon     11.0   1415.0
Evening  12.0  NaT  16.0

result

 Day-1Day-2 Day-3
Morning   FalseTrueFalse
Noon     TrueTrueTrue
Evening  TrueFalseTrue

πŸ’‘ Note: The notnull() method is an alias of the notna() method. The output from both examples is identical.


DataFrame pad()

The pad() method is an alias for DataFrame/Series fillna() with the parameter method set to 'ffill'. Click here for details.


DataFrame replace()

The replace() method substitutes values in a DataFrame/Series with a different value assigned. This operation is performed dynamically on the object passed.

πŸ’‘ Note: The .loc/.iloc methods are slightly different from replace() as they require a specific location to change the said value(s).

The syntax for this method is as follows:

DataFrame.replace(to_replace=None, value=None, 
                  inplace=False, limit=None, 
                  regex=False, method='pad')
ParameterDescription
to_replaceDetermines how to locate values to replace. The following parameters are:
– Numeric, String, or Regex.
– List of Strings, Regex, or Numeric.
– Dictionary: a Dictionary, DataFrame Dictionary, or Nested Dictionary
Each one must exactly match the to_replace parameter to cause any change.
valueThe value to replace any values that match.
inplaceIf set to True, the changes apply to the original DataFrame/Series. If False, the changes apply to a new DataFrame/Series. By default, False.
limitThe maximum number of elements to backward/forward fill.
regexA regex expression to match. Matches resolve to the value parameter.
methodThe available options for this method are pad, ffill, bfill, or None. Specify the replacement method to use.

Possible Errors Raised

ErrorWhen Does It Occur?
AssertionErrorIf regex is not a Boolean (True/False), or the to_replace parameter is None.
TypeErrorIf to_replace is not in a valid format, such as:
– Not scalar, an array, a dictionary, or is None.
– If to_replace is a dictionary and the value parameter is not a list.
– If multiple Booleans or date objects and to_replace fails to match the value parameter.
ValueErrorAny error returns if a list/ndarray and value are not the same length.

The examples below show how versatile the replace() method is. We recommend you spend some time reviewing the code and output.

In this example, we have five (5) grades for a student. Notice that one (1) grade is a failing grade. To rectify this, run the following code:

Code – Example 1

grades = pd.Series([55, 64, 52, 76, 49])
print(grades)

result = grades.replace(49, 51)
print(result)
  • Line [1] creates a Series of Lists and saves it to grades.
  • Line [2] modifies the failing grade of 49 to a passing grade of 51. The output saves to result.
  • Line [3] outputs the result to the terminal.

Output

O55
164
252
376
451
dtype: int64

This example shows a DataFrame of three (3) product lines for Rivers Clothing. They want the price of 11.35 changed to 12.95. Run the code below to change the pricing.

Code – Example 2

df = pd.DataFrame({'Tops':     [10.12, 12.23, 11.35],
                   'Tanks':    [11.35, 13.45, 14.98],
                   'Sweats':  [11.35, 21.85, 35.75]})

result = df.replace(11.35, 12.95)
print(result)
  • Line [1] creates a dictionary of lists and saves it to df.
  • Line [2] replaces the value 11.35 to 12.95 for each occurrence. The output saves to result.
  • Line [3] outputs the result to the terminal.

Output

 TopsTanksSweats
010.12 12.95 12.95
112.23 13.45  21.85
212.95 14.98  35.75

Code – Example 3

This example shows a DataFrame with two (2) teams. Each team contains three (3) members. This code removes one (1) member from each team and replaces it with quit.

df = pd.DataFrame({'Team-1': ['Barb', 'Todd', 'Taylor'],
                   'Team-2': ['Arch', 'Bart', 'Alex']})

result = df.replace(to_replace=r'^Bar.$', value='quit', regex=True)
print(result)
  • Line [1] creates a Dictionary of Lists and saves it to df.
  • Line [2] replaces any values that start with Bar and contain one (1) additional character (.). This match changed to the word quit. The output saves to result.
  • Line [3] outputs the result to the terminal.