Pandas - How to Find DataFrame Row Indices with NaN or Null Values

Problem Formulation and Solution Overview

This article will show you how to find DataFrame row indices in Python with NaN or Null (empty) values.

To make it more interesting, we have the following scenario:

Rivers Clothing has given you a CSV file that requires a clean-up to make it usable for Payroll and Data Analysis. This CSV file contains numerous rows with NaN or Null (empty) values.

Snippet of rivers_emps.csv file

id	f_name	l_name	hired	wages
1001	Donald		21-Jun-07	NaN
1002	Douglas	Grant	13-Jan-08	121500.0
1003	Jennifer	Bursey	17-Sep-03	NaN
1004	Michael	Hartstein	17-Feb-04	24896.0
1005	NaN	Fay	17-Aug-05	NaN
1006	Susan	Mavris	7-Jun-02	NaN
1007	Hermann	Baer	7-Jun-02	98653.0
1008	Shelley	Higgins	7-Jun-02	NaN
1009	William	Gietz	7-Jun-02	84300.0

💬 Question: How would we write code to find DataFrame row indices with NaN or Null (empty) values to start the clean-up process?

We can accomplish this task by one of the following options:

Method 1: Use List Comprehension
Method 2: Use loc() and isna()
Method 3: Use isnull()
Method 4: Use for loop, isnull() and iloc()
Method 5: Use NumPy Library and isnan()

Preparation

Before moving forward, please ensure the Pandas library is installed. Click here if you require instructions.

Then, add the following code to the top of each script. This snippet will allow the code in this article to run error-free.

import pandas as pd

After importing the Pandas library, this library is referenced by calling the shortcode (pd).

Method 1: Use List Comprehension

This example uses List Comprehension in conjunction with iterrows(), isnull(), and any() to iterate and search a DataFrame for NaN or Null (empty) values. A list of invalid row indices returns.

emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_rows = [index for index, row in emps.iterrows() if row.isnull().any()]
print(invalid_rows)

The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). The results save to the DataFrame emps.

Next, List Comprehension is used to:

Read in one row per iteration (emps.iterrows()).
Iterate through the row columns (index for index).
Search column contents for NaN or Null (empty) values.
If True, the row index is appended to invalid_rows.

The results of invalid_rows are output to the terminal.

[1001, 1003, 1005, 1006, 1008]

If we compare the results with the original CSV file, we see the results are as expected.

id	f_name	l_name	hired	wages
1001	Donald		21-Jun-07	NaN
1003	Jennifer	Bursey	17-Sep-03	NaN
1005	NaN	Fay	17-Aug-05	NaN
1006	Susan	Mavris	7-Jun-02	NaN
1008	Shelley	Higgins	7-Jun-02	NaN

Method 2: Use Pandas loc() and isna()

This example uses the Pandas loc() and isna functions to iterate through a DataFrame column searching for NaN or Null (empty) values. A Class Object returns.

emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_wages = emps.loc[pd.isna(emps['wages']), :].index

The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.

Then, the loc() function is called and passed the emps DataFrame wages column. In this case, all rows of the wages column are searched for NaN or Null (empty) values.

The index (id) of the row(s) containing NaN or Null (empty) values is appended to invalid_wages, and a Class Object is returned. To confirm this, type() is called, passing one (1) argument, invalid_wages and output to the terminal.

print(type(invalid_wages))

<class 'pandas.core.indexes.numeric.Int64Index'>

Next, the contents of invalid_wages are output to the terminal.

print(invalid_wages)

Int64Index([1001, 1003, 1006, 1008], dtype='int64', name='id')

If we compare the results with the original CSV file, we see the results are as expected.

id	f_name	l_name	hired	wages
1001	Donald		21-Jun-07	NaN
1003	Jennifer	Bursey	17-Sep-03	NaN
1005	NaN	Fay	17-Aug-05	NaN
1006	Susan	Mavris	7-Jun-02	NaN
1008	Shelley	Higgins	7-Jun-02	NaN

Method 3: Use isnull()

This example uses the isnull() function to iterate through a DataFrame column searching for NaN or Null (empty) values. A Class Object returns.

emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_lname = emps[emps['l_name'].isnull()]

The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.

Then, emps is called and passed one (1) argument, the l_name column. Next, the isnull() function is applied to each row value and searches for NaN or Null (empty) values.

The results save to invalid_lname and are output to the terminal as Booleans. This output indicates one (1) match was found, Employee 1001.

id
1001	True
1002	False
1003	False
1004	False
1005	False
1006	False
1007	False
1008	False
1009	False
Name: l_name, dtype: bool

These results can be filtered to display only row indices containing a True value.

invalid_lname_true = emps[emps['l_name'].isnull() == True]
print(invalid_lname_true)

Above, additional code is added (== True) after the isnull() function. This ensures only True values are appended to invalid_name_true. The results are output to the terminal.

1001 Donald NaN 21-Jun-07 NaN

Method 4: Use For loop isnull() and iloc()

This example uses a for loop, isnull() and iloc() functions to iterate through a DataFrame column searching for NaN or Null (empty) values.

emps = pd.read_csv('rivers_emps.csv', index_col='id')

for i in range(emps.shape[0]):
    if emps['wages'].isnull().iloc[i]:
        print(i, end=' ')

The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.

Next, a For loop is instantiated and uses emps.shape[0] to determine the size of the DataFrame (in this case, 9 rows). This loop iterates over each row of the wages column. If the contents of the wages row contains a NaN or Null (empty) value, the index location (iloc[i]) is output to the terminal.

0 2 4 5 7

💡Note: To display the output on one line, replace the default newline character in the print() statement with a space (‘ ‘) character.

To confirm that the indices returned are correct, view the contents of emps (including the row numbers 0-8).

	id	f_name	l_name	hired	wages
0	1001	Donald		21-Jun-07	NaN
1	1002	Douglas	Grant	13-Jan-08	121500.0
2	1003	Jennifer	Bursey	17-Sep-03	NaN
3	1004	Michael	Hartstein	17-Feb-04	24896.0
4	1005	NaN	Fay	17-Aug-05	NaN
5	1006	Susan	Mavris	7-Jun-02	NaN
6	1007	Hermann	Baer	7-Jun-02	98653.0
7	1008	Shelley	Higgins	7-Jun-02	NaN
8	1009	William	Gietz	7-Jun-02	84300.0

Method 5: Use NumPy isnan()

This method imports the NumPy library and uses the isnan method to iterate through a DataFrame column searching for NaN or Null (empty) values.

Before moving forward, please ensure the NumPy library is installed. Click here if you require instructions.

This library is referenced by calling the shortcode (np).

import numpy as np

emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_wages = emps['wages'].index[emps['wages'].apply(np.isnan)]
print(invalid_wages)

Above imports the NumPy library and references it as np.

Then, the code reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.

Next, the wages column is called in and accessed via the index (id) value. Then, the apply() function is appended to the end of the code line and passed np.isnan as an argument.

If the contents of any wages row contain a NaN or Null (empty) value, the index (id) is saved as the Class Object invalid_wages and output to the terminal.

Int64Index([1001, 1003, 1005, 1006, 1008], dtype='int64', name='id')

Bonus: Putting it Together!

This article used several ways to find row indices with NaN or Null (empty) values. However, let’s put this together to clean up the wages data!

import pandas as pd 

emps_wages = pd.read_csv('rivers_emps.csv', usecols=['id', 'wages'], index_col='id')
wages_update = {1001: 129546.77, 1003: 65498.11, 1004: 58310.22, 1005: 119615.76, 1007: 88125.45}

for i, j in emps_wages.iterrows():
    for k, v in wages_update.items():
        if (i == k): 
            v == j
            print(i, k, v)

Above, imports the Pandas library.

The following line reads in two (2) columns from the CSV file, id and wages, and sets the index to id. This saves to emps_wages.

Then, a Python dictionary is declared. This dictionary contains the missing wage data using key:value pairs (id:wages format). The dictionary key contains the id column to search for, and the value is the new wage. This saves to wages_update.

Next, two (2) for loops are instantiated. The first loop iterates through the DataFrame emps_wages. The second iterates through the wages_updates dictionary.

The ids are compared, and if a match is found, the missing data in the DataFrame is replaced with the value data in the corresponding Dictionary wages_update.

🧩A Finxter Challenge!
Combine the knowledge you learned here to finish cleaning up the CSV file!
Save the updated data to a new CSV, and update the missing Last Name.

Summary

These six (6) methods of how to find DataFrame row indices in Python with NaN or Null (empty) values should give you enough information to select the best one for your coding requirements.

Good Luck & Happy Coding!

Pandas – How to Find DataFrame Row Indices with NaN or Null Values

Problem Formulation and Solution Overview

Preparation

Method 1: Use List Comprehension

Method 2: Use Pandas loc() and isna()

Method 3: Use isnull()

Method 4: Use For loop isnull() and iloc()

Method 5: Use NumPy isnan()

Bonus: Putting it Together!

Summary

Programming Humor – Python