5 Best Ways to Count NaN Values in a Column in a Python Pandas DataFrame

πŸ’‘ Problem Formulation: When working with datasets in Pandas, it’s common to encounter missing data, often represented as NaN (Not a Number) values. Accurately counting these NaNs within individual DataFrame columns is essential for data cleaning and analysis. The input is a Pandas DataFrame with a mixture of numeric and NaN values, while the desired output is the count of NaN values in a specified column.

Method 1: Using isna() and sum() Methods

This method combines the Pandas isna() function, which returns a boolean mask indicating the location of NaNs, with the sum() method to count the True values, which correspond to the NaNs.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, np.nan, 4],
                   'B': [np.nan, 2, np.nan, 4]})

# Count NaNs in column 'A'
nan_count_a = df['A'].isna().sum()

# Count NaNs in column 'B'
nan_count_b = df['B'].isna().sum()

Output:

nan_count_a: 1
nan_count_b: 2

This snippet creates a DataFrame with some NaN values, then uses isna() to generate a Boolean mask, and finally sums the mask to get the total count of NaNs in each column.

Method 2: Using isnull() Function

The isnull() function in Pandas is similar to isna(), as it also returns a boolean mask of NaN values, which can be summed to count NaNs. This method serves as an alias to isna(), providing syntactic variety.

Here’s an example:

nan_count_a = df['A'].isnull().sum()
nan_count_b = df['B'].isnull().sum()

Output remains the same as in Method 1.

Here isnull() is used interchangeably with isna() to achieve the same result, offering flexibility in the function naming preference for users.

Method 3: Using the info() Method

The info() method provides a concise summary of the DataFrame, including the number of non-null entries in each column. Subtracting the non-null count from the total length of the DataFrame gives the count of NaN values.

Here’s an example:

df.info()

# Manual calculation
total_entries = len(df)
nan_count_a = total_entries - df['A'].count()
nan_count_b = total_entries - df['B'].count()

Output from info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      float64
 1   B       2 non-null      float64
dtypes: float64(2)
memory usage: 192.0 bytes

Using info() provides an overview, then subtracting non-null values from the DataFrame’s length gets the count of NaNs.

Method 4: Using value_counts() with the dropna Parameter

Though not directly counting NaN values, value_counts() returned with dropna=False includes the NaN count in its output. The total count minus non-NaN values yield the NaN count.

Here’s an example:

non_nan_count_a = df['A'].value_counts(dropna=False).sum() - df['A'].count()
non_nan_count_b = df['B'].value_counts(dropna=False).sum() - df['B'].count()

The output will be the same counts of NaNs as previous methods.

Using value_counts() gives an overall count of each unique value, including NaN when the dropna=False argument is passed, after which the NaN count is computed.

Bonus One-Liner Method 5: Using len() and Comprehension

This one-liner method uses a list comprehension with len() to count NaNs by directly comparing each value with np.nan.

Here’s an example:

nan_count_a = len([x for x in df['A'] if pd.isna(x)])
nan_count_b = len([x for x in df['B'] if pd.isna(x)])

The output will be the same: 1 for column ‘A’ and 2 for column ‘B’.

Although less efficient for large datasets, this method directly counts NaNs by iterating over the column and checking each element.

Summary/Discussion

  • Method 1: isna() with sum(). Direct and efficient. Best for quick counts; widely used and recommended.
  • Method 2: isnull(). An alias to isna(). Offers coding style flexibility while maintaining efficiency.
  • Method 3: info(). More informative, but requires an additional calculation. Useful for gaining an overall understanding of DataFrame structure.
  • Method 4: value_counts(). Less direct, but useful if also interested in counts of all other values. Requires additional arithmetic to find NaN counts.
  • Bonus Method 5: List comprehension with len(). Pythonic, but less performant for large data. Useful for simple scenarios or when operating outside of Pandas methods.