Problem Formulation and Solution Overview
To make it more interesting, we have the following scenario:
Snippet of rivers_emps.csv file
id | f_name | l_name | hired | wages |
1001 | Donald | 21-Jun-07 | NaN | |
1002 | Douglas | Grant | 13-Jan-08 | 121500.0 |
1003 | Jennifer | Bursey | 17-Sep-03 | NaN |
1004 | Michael | Hartstein | 17-Feb-04 | 24896.0 |
1005 | NaN | Fay | 17-Aug-05 | NaN |
1006 | Susan | Mavris | 7-Jun-02 | NaN |
1007 | Hermann | Baer | 7-Jun-02 | 98653.0 |
1008 | Shelley | Higgins | 7-Jun-02 | NaN |
1009 | William | Gietz | 7-Jun-02 | 84300.0 |
Preparation
import pandas as pd
Method 1: Use List Comprehension
This example uses List Comprehension in conjunction with iterrows()
, isnull()
, and any()
to iterate and search a DataFrame for NaN or Null (empty) values. A list of invalid row indices returns.
emps = pd.read_csv('rivers_emps.csv', index_col='id') invalid_rows = [index for index, row in emps.iterrows() if row.isnull().any()] print(invalid_rows)
The above reads in the rivers_emps.csv
file and sets the DataFrame index
to the id
column (index_col='id'
). The results save to the DataFrame emps
.
Next, List Comprehension is used to:
- Read in one row per iteration (
emps.iterrows()
). - Iterate through the row columns (
index for index
). - Search column contents for NaN or Null (empty) values.
- If
True
, the row index is appended toinvalid_rows
.
The results of invalid_rows
are output to the terminal.
[1001, 1003, 1005, 1006, 1008] |
If we compare the results with the original CSV file, we see the results are as expected.
id | f_name | l_name | hired | wages |
1001 | Donald | 21-Jun-07 | NaN | |
1003 | Jennifer | Bursey | 17-Sep-03 | NaN |
1005 | NaN | Fay | 17-Aug-05 | NaN |
1006 | Susan | Mavris | 7-Jun-02 | NaN |
1008 | Shelley | Higgins | 7-Jun-02 | NaN |
Method 2: Use Pandas loc() and isna()
This example uses the Pandas loc()
and isna
functions to iterate through a DataFrame column searching for NaN or Null (empty) values. A Class Object returns.
emps = pd.read_csv('rivers_emps.csv', index_col='id') invalid_wages = emps.loc[pd.isna(emps['wages']), :].index
The above reads in the rivers_emps.csv
file and sets the DataFrame index
to the id
column (index_col='id'
). This saves to the DataFrame emps
.
Then, the loc()
function is called and passed the emps
DataFrame wages
column. In this case, all rows of the wages
column are searched for NaN or Null (empty) values.
The index
(id
) of the row(s) containing NaN or Null (empty) values is appended to invalid_wages
, and a Class Object is returned. To confirm this, type()
is called, passing one (1) argument, invalid_wages
and output to the terminal.
print(type(invalid_wages))
<class 'pandas.core.indexes.numeric.Int64Index'> |
Next, the contents of invalid_wages
are output to the terminal.
print(invalid_wages)
Int64Index([1001, 1003, 1006, 1008], dtype='int64', name='id') |
If we compare the results with the original CSV file, we see the results are as expected.
id | f_name | l_name | hired | wages |
1001 | Donald | 21-Jun-07 | NaN | |
1003 | Jennifer | Bursey | 17-Sep-03 | NaN |
1005 | NaN | Fay | 17-Aug-05 | NaN |
1006 | Susan | Mavris | 7-Jun-02 | NaN |
1008 | Shelley | Higgins | 7-Jun-02 | NaN |
Method 3: Use isnull()
This example uses the isnull()
function to iterate through a DataFrame column searching for NaN or Null (empty) values. A Class Object returns.
emps = pd.read_csv('rivers_emps.csv', index_col='id') invalid_lname = emps[emps['l_name'].isnull()]
The above reads in the rivers_emps.csv
file and sets the DataFrame index
to the id
column (index_col='id'
). This saves to the DataFrame emps
.
Then, emps
is called and passed one (1) argument, the l_name
column. Next, the isnull()
function is applied to each row value and searches for NaN or Null (empty) values.
The results save to invalid_lname
and are output to the terminal as Booleans. This output indicates one (1) match was found, Employee 1001.
id | |
1001 | True |
1002 | False |
1003 | False |
1004 | False |
1005 | False |
1006 | False |
1007 | False |
1008 | False |
1009 | False |
Name: l_name, dtype: bool |
These results can be filtered to display only row indices containing a True
value.
invalid_lname_true = emps[emps['l_name'].isnull() == True] print(invalid_lname_true)
Above, additional code is added (== True
) after the isnull()
function. This ensures only True
values are appended to invalid_name_true
. The results are output to the terminal.
id |
1001 Donald NaN 21-Jun-07 NaN |
Method 4: Use For loop isnull() and iloc()
This example uses a for
loop, isnull()
and iloc()
functions to iterate through a DataFrame column searching for NaN or Null (empty) values.
emps = pd.read_csv('rivers_emps.csv', index_col='id') for i in range(emps.shape[0]): if emps['wages'].isnull().iloc[i]: print(i, end=' ')
The above reads in the rivers_emps.csv
file and sets the DataFrame index
to the id
column (index_col='id'
). This saves to the DataFrame emps
.
Next, a For
loop is instantiated and uses emps.shape[0]
to determine the size of the DataFrame (in this case, 9 rows). This loop iterates over each row of the wages
column. If the contents of the wages
row contains a NaN or Null (empty) value, the index location (iloc[i]
) is output to the terminal.
0 2 4 5 7 |
π‘Note: To display the output on one line, replace the default newline character in the print()
statement with a space (‘ ‘) character.
To confirm that the indices returned are correct, view the contents of emps
(including the row numbers 0-8).
id | f_name | l_name | hired | wages | |
0 | 1001 | Donald | 21-Jun-07 | NaN | |
1 | 1002 | Douglas | Grant | 13-Jan-08 | 121500.0 |
2 | 1003 | Jennifer | Bursey | 17-Sep-03 | NaN |
3 | 1004 | Michael | Hartstein | 17-Feb-04 | 24896.0 |
4 | 1005 | NaN | Fay | 17-Aug-05 | NaN |
5 | 1006 | Susan | Mavris | 7-Jun-02 | NaN |
6 | 1007 | Hermann | Baer | 7-Jun-02 | 98653.0 |
7 | 1008 | Shelley | Higgins | 7-Jun-02 | NaN |
8 | 1009 | William | Gietz | 7-Jun-02 | 84300.0 |
Method 5: Use NumPy isnan()
This method imports the NumPy library and uses the isnan
method to iterate through a DataFrame column searching for NaN or Null (empty) values.
Before moving forward, please ensure the NumPy library is installed. Click here if you require instructions.
This library is referenced by calling the shortcode (np
).
import numpy as np emps = pd.read_csv('rivers_emps.csv', index_col='id') invalid_wages = emps['wages'].index[emps['wages'].apply(np.isnan)] print(invalid_wages)
Above imports the NumPy library and references it as np
.
Then, the code reads in the rivers_emps.csv
file and sets the DataFrame index
to the id
column (index_col='id'
). This saves to the DataFrame emps
.
Next, the wages
column is called in and accessed via the index (id
) value. Then, the apply()
function is appended to the end of the code line and passed np.isnan
as an argument.
If the contents of any wages
row contain a NaN or Null (empty) value, the index (id
) is saved as the Class Object invalid_wages
and output to the terminal.
Int64Index([1001, 1003, 1005, 1006, 1008], dtype='int64', name='id') |
Bonus: Putting it Together!
This article used several ways to find row indices with NaN or Null (empty) values. However, let’s put this together to clean up the wages
data!
import pandas as pd emps_wages = pd.read_csv('rivers_emps.csv', usecols=['id', 'wages'], index_col='id') wages_update = {1001: 129546.77, 1003: 65498.11, 1004: 58310.22, 1005: 119615.76, 1007: 88125.45} for i, j in emps_wages.iterrows(): for k, v in wages_update.items(): if (i == k): v == j print(i, k, v)
Above, imports the Pandas library.
The following line reads in two (2) columns from the CSV file, id
and wages
, and sets the index to id
. This saves to emps_wages
.
Then, a Python dictionary is declared. This dictionary contains the missing wage data using key:value
pairs (id:wages
format). The dictionary key
contains the id
column to search for, and the value
is the new wage. This saves to wages_update
.
Next, two (2) for
loops are instantiated. The first loop iterates through the DataFrame emps_wages
. The second iterates through the wages_updates
dictionary
.
The ids
are compared, and if a match is found, the missing data in the DataFrame is replaced with the value data in the corresponding Dictionary wages_update
.
π§©A Finxter Challenge!
Combine the knowledge you learned here to finish cleaning up the CSV file!
Save the updated data to a new CSV, and update the missing Last Name.
Summary
Good Luck & Happy Coding!