Problem Formulation and Solution Overview
To make it more interesting, we have the following scenario:
β₯οΈ Info: Are you AI curious but you still have to create real impactful projects? Join our official AI builder club on Skool (only $5): SHIP! - One Project Per Month
Snippet of rivers_emps.csv file
| id | f_name | l_name | hired | wages |
| 1001 | Donald | 21-Jun-07 | NaN | |
| 1002 | Douglas | Grant | 13-Jan-08 | 121500.0 |
| 1003 | Jennifer | Bursey | 17-Sep-03 | NaN |
| 1004 | Michael | Hartstein | 17-Feb-04 | 24896.0 |
| 1005 | NaN | Fay | 17-Aug-05 | NaN |
| 1006 | Susan | Mavris | 7-Jun-02 | NaN |
| 1007 | Hermann | Baer | 7-Jun-02 | 98653.0 |
| 1008 | Shelley | Higgins | 7-Jun-02 | NaN |
| 1009 | William | Gietz | 7-Jun-02 | 84300.0 |
Preparation
import pandas as pd
Method 1: Use List Comprehension
This example uses List Comprehension in conjunction with iterrows(), isnull(), and any() to iterate and search a DataFrame for NaN or Null (empty) values. A list of invalid row indices returns.
emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_rows = [index for index, row in emps.iterrows() if row.isnull().any()]
print(invalid_rows)The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). The results save to the DataFrame emps.
Next, List Comprehension is used to:
- Read in one row per iteration (
emps.iterrows()). - Iterate through the row columns (
index for index). - Search column contents for NaN or Null (empty) values.
- If
True, the row index is appended toinvalid_rows.
The results of invalid_rows are output to the terminal.
| [1001, 1003, 1005, 1006, 1008] |
If we compare the results with the original CSV file, we see the results are as expected.
| id | f_name | l_name | hired | wages |
| 1001 | Donald | 21-Jun-07 | NaN | |
| 1003 | Jennifer | Bursey | 17-Sep-03 | NaN |
| 1005 | NaN | Fay | 17-Aug-05 | NaN |
| 1006 | Susan | Mavris | 7-Jun-02 | NaN |
| 1008 | Shelley | Higgins | 7-Jun-02 | NaN |
Method 2: Use Pandas loc() and isna()
This example uses the Pandas loc() and isna functions to iterate through a DataFrame column searching for NaN or Null (empty) values. A Class Object returns.
emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_wages = emps.loc[pd.isna(emps['wages']), :].indexThe above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.
Then, the loc() function is called and passed the emps DataFrame wages column. In this case, all rows of the wages column are searched for NaN or Null (empty) values.
The index (id) of the row(s) containing NaN or Null (empty) values is appended to invalid_wages, and a Class Object is returned. To confirm this, type() is called, passing one (1) argument, invalid_wages and output to the terminal.
print(type(invalid_wages))
<class 'pandas.core.indexes.numeric.Int64Index'> |
Next, the contents of invalid_wages are output to the terminal.
print(invalid_wages)
Int64Index([1001, 1003, 1006, 1008], dtype='int64', name='id') |
If we compare the results with the original CSV file, we see the results are as expected.
| id | f_name | l_name | hired | wages |
| 1001 | Donald | 21-Jun-07 | NaN | |
| 1003 | Jennifer | Bursey | 17-Sep-03 | NaN |
| 1005 | NaN | Fay | 17-Aug-05 | NaN |
| 1006 | Susan | Mavris | 7-Jun-02 | NaN |
| 1008 | Shelley | Higgins | 7-Jun-02 | NaN |
Method 3: Use isnull()
This example uses the isnull() function to iterate through a DataFrame column searching for NaN or Null (empty) values. A Class Object returns.
emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_lname = emps[emps['l_name'].isnull()]The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.
Then, emps is called and passed one (1) argument, the l_name column. Next, the isnull() function is applied to each row value and searches for NaN or Null (empty) values.
The results save to invalid_lname and are output to the terminal as Booleans. This output indicates one (1) match was found, Employee 1001.
| id | |
| 1001 | True |
| 1002 | False |
| 1003 | False |
| 1004 | False |
| 1005 | False |
| 1006 | False |
| 1007 | False |
| 1008 | False |
| 1009 | False |
| Name: l_name, dtype: bool |
These results can be filtered to display only row indices containing a True value.
invalid_lname_true = emps[emps['l_name'].isnull() == True] print(invalid_lname_true)
Above, additional code is added (== True) after the isnull() function. This ensures only True values are appended to invalid_name_true. The results are output to the terminal.
| id |
| 1001 Donald NaN 21-Jun-07 NaN |
Method 4: Use For loop isnull() and iloc()
This example uses a for loop, isnull() and iloc() functions to iterate through a DataFrame column searching for NaN or Null (empty) values.
emps = pd.read_csv('rivers_emps.csv', index_col='id')
for i in range(emps.shape[0]):
if emps['wages'].isnull().iloc[i]:
print(i, end=' ')The above reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.
Next, a For loop is instantiated and uses emps.shape[0] to determine the size of the DataFrame (in this case, 9 rows). This loop iterates over each row of the wages column. If the contents of the wages row contains a NaN or Null (empty) value, the index location (iloc[i]) is output to the terminal.
| 0 2 4 5 7 |
π‘Note: To display the output on one line, replace the default newline character in the print() statement with a space (‘ ‘) character.
To confirm that the indices returned are correct, view the contents of emps (including the row numbers 0-8).
| id | f_name | l_name | hired | wages | |
| 0 | 1001 | Donald | 21-Jun-07 | NaN | |
| 1 | 1002 | Douglas | Grant | 13-Jan-08 | 121500.0 |
| 2 | 1003 | Jennifer | Bursey | 17-Sep-03 | NaN |
| 3 | 1004 | Michael | Hartstein | 17-Feb-04 | 24896.0 |
| 4 | 1005 | NaN | Fay | 17-Aug-05 | NaN |
| 5 | 1006 | Susan | Mavris | 7-Jun-02 | NaN |
| 6 | 1007 | Hermann | Baer | 7-Jun-02 | 98653.0 |
| 7 | 1008 | Shelley | Higgins | 7-Jun-02 | NaN |
| 8 | 1009 | William | Gietz | 7-Jun-02 | 84300.0 |
Method 5: Use NumPy isnan()
This method imports the NumPy library and uses the isnan method to iterate through a DataFrame column searching for NaN or Null (empty) values.
Before moving forward, please ensure the NumPy library is installed. Click here if you require instructions.
This library is referenced by calling the shortcode (np).
import numpy as np
emps = pd.read_csv('rivers_emps.csv', index_col='id')
invalid_wages = emps['wages'].index[emps['wages'].apply(np.isnan)]
print(invalid_wages)Above imports the NumPy library and references it as np.
Then, the code reads in the rivers_emps.csv file and sets the DataFrame index to the id column (index_col='id'). This saves to the DataFrame emps.
Next, the wages column is called in and accessed via the index (id) value. Then, the apply() function is appended to the end of the code line and passed np.isnan as an argument.
If the contents of any wages row contain a NaN or Null (empty) value, the index (id) is saved as the Class Object invalid_wages and output to the terminal.
Int64Index([1001, 1003, 1005, 1006, 1008], dtype='int64', name='id') |
Bonus: Putting it Together!
This article used several ways to find row indices with NaN or Null (empty) values. However, let’s put this together to clean up the wages data!
import pandas as pd
emps_wages = pd.read_csv('rivers_emps.csv', usecols=['id', 'wages'], index_col='id')
wages_update = {1001: 129546.77, 1003: 65498.11, 1004: 58310.22, 1005: 119615.76, 1007: 88125.45}
for i, j in emps_wages.iterrows():
for k, v in wages_update.items():
if (i == k):
v == j
print(i, k, v)Above, imports the Pandas library.
The following line reads in two (2) columns from the CSV file, id and wages, and sets the index to id. This saves to emps_wages.
Then, a Python dictionary is declared. This dictionary contains the missing wage data using key:value pairs (id:wages format). The dictionary key contains the id column to search for, and the value is the new wage. This saves to wages_update.
Next, two (2) for loops are instantiated. The first loop iterates through the DataFrame emps_wages. The second iterates through the wages_updates dictionary.
The ids are compared, and if a match is found, the missing data in the DataFrame is replaced with the value data in the corresponding Dictionary wages_update.
π§©A Finxter Challenge!
Combine the knowledge you learned here to finish cleaning up the CSV file!
Save the updated data to a new CSV, and update the missing Last Name.
Summary
Good Luck & Happy Coding!
Programming Humor – Python
