Pandas – How to Find DataFrame Row Indices with NaN or Null Values

Problem Formulation and Solution Overview

This article will show you how to find DataFrame row indices in Python with NaN or Null (empty) values.

To make it more interesting, we have the following scenario:

Rivers Clothing has given you a CSV file that requires a clean-up to make it usable for Payroll and Data Analysis. This CSV file contains numerous rows with NaN or Null (empty) values.

Snippet of rivers_emps.csv file

idf_namel_namehiredwages
1001Donald21-Jun-07NaN
1002DouglasGrant13-Jan-08121500.0
1003JenniferBursey17-Sep-03NaN
1004MichaelHartstein17-Feb-0424896.0
1005NaNFay17-Aug-05NaN
1006SusanMavris7-Jun-02NaN
1007HermannBaer7-Jun-0298653.0
1008ShelleyHiggins7-Jun-02NaN
1009WilliamGietz7-Jun-0284300.0

πŸ’¬ Question: How would we write code to find DataFrame row indices with NaN or Null (empty) values to start the clean-up process?

We can accomplish this task by one of the following options:


Preparation

Before moving forward, please ensure the Pandas library is installed. Click here if you require instructions.

Then, add the following code to the top of each script. This snippet will allow the code in this article to run error-free.

import pandas as pd

After importing the Pandas library, this library is referenced by calling the shortcode (pd).


Method 1: Use List Comprehension

This example uses List Comprehension in conjunction with iterrows(), isnull(), and any() to iterate and search a DataFrame for NaN or Null (empty) values. A list of invalid row indices returns.

emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_rows = [index for index, row in emps.iterrows() if row.isnull().any()]
print(invalid_rows)

The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). The results save to the DataFrame emps.

Next, List Comprehension is used to:

  • Read in one row per iteration (emps.iterrows()).
  • Iterate through the row columns (index for index).
  • Search column contents for NaN or Null (empty) values.
  • If True, the row index is appended to invalid_rows.

The results of invalid_rows are output to the terminal.

[1001, 1003, 1005, 1006, 1008]

If we compare the results with the original CSV file, we see the results are as expected.

idf_namel_namehiredwages
1001Donald21-Jun-07NaN
1003JenniferBursey17-Sep-03NaN
1005NaNFay17-Aug-05NaN
1006SusanMavris7-Jun-02NaN
1008ShelleyHiggins7-Jun-02NaN

Method 2: Use Pandas loc() and isna()

This example uses the Pandas loc() and isna functions to iterate through a DataFrame column searching for NaN or Null (empty) values. A Class Object returns.

emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_wages = emps.loc[pd.isna(emps['wages']), :].index

The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.

Then, the loc() function is called and passed the emps DataFrame wages column. In this case, all rows of the wages column are searched for NaN or Null (empty) values.

The index (id) of the row(s) containing NaN or Null (empty) values is appended to invalid_wages, and a Class Object is returned. To confirm this, type() is called, passing one (1) argument, invalid_wages and output to the terminal.

print(type(invalid_wages))
<class 'pandas.core.indexes.numeric.Int64Index'>

Next, the contents of invalid_wages are output to the terminal.

print(invalid_wages)
Int64Index([1001, 1003, 1006, 1008], dtype='int64', name='id')

If we compare the results with the original CSV file, we see the results are as expected.

idf_namel_namehiredwages
1001Donald21-Jun-07NaN
1003JenniferBursey17-Sep-03NaN
1005NaNFay17-Aug-05NaN
1006SusanMavris7-Jun-02NaN
1008ShelleyHiggins7-Jun-02NaN

Method 3: Use isnull()

This example uses the isnull() function to iterate through a DataFrame column searching for NaN or Null (empty) values. A Class Object returns.

emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_lname = emps[emps['l_name'].isnull()]

The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.

Then, emps is called and passed one (1) argument, the l_name column. Next, the isnull() function is applied to each row value and searches for NaN or Null (empty) values.

The results save to invalid_lname and are output to the terminal as Booleans. This output indicates one (1) match was found, Employee 1001.

id
1001True
1002False
1003False
1004False
1005False
1006False
1007False
1008False
1009False
Name: l_name, dtype: bool

These results can be filtered to display only row indices containing a True value.

invalid_lname_true = emps[emps['l_name'].isnull() == True]
print(invalid_lname_true)

Above, additional code is added (== True) after the isnull() function. This ensures only True values are appended to invalid_name_true. The results are output to the terminal.

id
1001 Donald NaN 21-Jun-07 NaN

Method 4: Use For loop isnull() and iloc()

This example uses a for loop, isnull() and iloc() functions to iterate through a DataFrame column searching for NaN or Null (empty) values.

emps = pd.read_csv('rivers_emps.csv', index_col='id')

for i in range(emps.shape[0]):
    if emps['wages'].isnull().iloc[i]:
        print(i, end=' ')

The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.

Next, a For loop is instantiated and uses emps.shape[0] to determine the size of the DataFrame (in this case, 9 rows). This loop iterates over each row of the wages column. If the contents of the wages row contains a NaN or Null (empty) value, the index location (iloc[i]) is output to the terminal.

0 2 4 5 7

πŸ’‘Note: To display the output on one line, replace the default newline character in the print() statement with a space (‘ ‘) character.

To confirm that the indices returned are correct, view the contents of emps (including the row numbers 0-8).

idf_namel_namehiredwages
01001Donald21-Jun-07NaN
11002DouglasGrant13-Jan-08121500.0
21003JenniferBursey17-Sep-03NaN
31004MichaelHartstein17-Feb-0424896.0
41005NaNFay17-Aug-05NaN
51006SusanMavris7-Jun-02NaN
61007HermannBaer7-Jun-0298653.0
71008ShelleyHiggins7-Jun-02NaN
81009WilliamGietz7-Jun-0284300.0

Method 5: Use NumPy isnan()

This method imports the NumPy library and uses the isnan method to iterate through a DataFrame column searching for NaN or Null (empty) values.

Before moving forward, please ensure the NumPy library is installed. Click here if you require instructions.

This library is referenced by calling the shortcode (np).

import numpy as np

emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_wages = emps['wages'].index[emps['wages'].apply(np.isnan)]
print(invalid_wages)

Above imports the NumPy library and references it as np.

Then, the code reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.

Next, the wages column is called in and accessed via the index (id) value. Then, the apply() function is appended to the end of the code line and passed np.isnan as an argument.

If the contents of any wages row contain a NaN or Null (empty) value, the index (id) is saved as the Class Object invalid_wages and output to the terminal.

Int64Index([1001, 1003, 1005, 1006, 1008], dtype='int64', name='id')

Bonus: Putting it Together!

This article used several ways to find row indices with NaN or Null (empty) values. However, let’s put this together to clean up the wages data!

import pandas as pd 

emps_wages = pd.read_csv('rivers_emps.csv', usecols=['id', 'wages'], index_col='id')
wages_update = {1001: 129546.77, 1003: 65498.11, 1004: 58310.22, 1005: 119615.76, 1007: 88125.45}

for i, j in emps_wages.iterrows():
    for k, v in wages_update.items():
        if (i == k): 
            v == j
            print(i, k, v)

Above, imports the Pandas library.

The following line reads in two (2) columns from the CSV file, id and wages, and sets the index to id. This saves to emps_wages.

Then, a Python dictionary is declared. This dictionary contains the missing wage data using key:value pairs (id:wages format). The dictionary key contains the id column to search for, and the value is the new wage. This saves to wages_update.

Next, two (2) for loops are instantiated. The first loop iterates through the DataFrame emps_wages. The second iterates through the wages_updates dictionary.

The ids are compared, and if a match is found, the missing data in the DataFrame is replaced with the value data in the corresponding Dictionary wages_update.

🧩A Finxter Challenge!
Combine the knowledge you learned here to finish cleaning up the CSV file!
Save the updated data to a new CSV, and update the missing Last Name.


Summary

These six (6) methods of how to find DataFrame row indices in Python with NaN or Null (empty) values should give you enough information to select the best one for your coding requirements.

Good Luck & Happy Coding!


Programming Humor – Python

“I wrote 20 short programs in Python yesterday. It was wonderful. Perl, I’m leaving you.”xkcd