Pandas NaN — Working With Missing Data

Rate this post

Pandas is Excel on steroids—the powerful Python library allows you to analyze structured and tabular data with surprising efficiency and ease. Pandas is one of the reasons why master coders reach 100x the efficiency of average coders. In today’s article, you’ll learn how to work with missing data—in particular, how to handle NaN values in Pandas DataFrames.

PANDAS ULTIMATE GUIDE TO NAN (NOT A NUMBER)

You’ll learn about all the different reasons why NaNs appear in your DataFrames—and how to handle them. Let’s get started!

Checking Series for NaN Values

Problem: How to check a series for NaN values?

Have a look at the following code:

import pandas as pd
import numpy as np

data = pd.Series([0, np.NaN, 2])
result = data.hasnans

print(result)
# True

Series can contain NaN-values—an abbreviation for Not-A-Number—that describe undefined values.

To check if a Series contains one or more NaN value, use the attribute hasnans. The attribute returns True if there is at least one NaN value and False otherwise.

There’s a NaN value in the Series, so the output is True.

Filtering Series Generates NaN

Problem: When filtering a Series with where() and no element passes the filtering condition, what’s the result?

import pandas as pd

xs = pd.Series([5, 1, 4, 2, 3])
xs.where(xs > 2, inplace=True)
result = xs.hasnans

print(result)
# True

The method where() filters a Series by a condition. Only the elements that satisfy the condition remain in the resulting Series. And what happens if a value doesn’t satisfy the condition? Per default, all rows not satisfying the condition are filled with NaN-values.

This is why our Series contains NaN-values after filtering it with the method where().

Working with Multiple Series of Different Lengths

Problem: If you element-wise add two Series objects with a different number of elements—what happens with the remaining elements?

import pandas as pd

s = pd.Series(range(0, 10))
t = pd.Series(range(0, 20))
result = (s + t)[1]

print(result)
# 2

To add two Series element-wise, use the default addition operator +. The Series do not need to have the same size because once the first Series ends, the subsequent element-wise results are NaN values.

At index 1 in the resulting Series, you get the result of 1 + 1 = 2.

Create a DataFrame From a List of Dictionaries with Unequal Keys

Problem: How to create a DataFrame from a list of dictionaries if the dictionaries have unequal keys? A DataFrame expects the same columns to be available for each row!

import pandas as pd

data = [{'Car':'Mercedes', 'Driver':'Hamilton, Lewis'},
        {'Car':'Ferrari', 'Driver':'Schumacher, Michael'},
        {'Car':'Lamborghini'}]

df = pd.DataFrame(data, index=['Rank 2', 'Rank 1', 'Rank 3'])
df.sort_index(inplace=True)
result = df['Car'].iloc[0]

print(result)
# Ferrari

You can create a DataFrame from a list of dictionaries. The dictionaries’ keys define the column labels, and the values define the columns’ entries. Not all dictionaries must contain the same keys. If a dictionary doesn’t contain a particular key, this will be interpreted as a NaN-value.

This code snippet uses string labels as index values to sort the DataFrame. After sorting the DataFrame, the row with index label Rank 1 is at location 0 in the DataFrame and the value in the column Car is Ferrari.

Sorting a DataFrame by Column with NaN Values

Problem: What happens if you sort a DataFrame by column if the column contains a NaN value?

import pandas as pd

df = pd.read_csv("Cars.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

selection = df.sort_values(by="engine-size")
result = selection.index.to_list()[0]
print(result)
# 1

In this code snippet, you sort the rows of the DataFrame by the values of the column engine-size.

The main point is that NaN values are always moved to the end in Pandas sorting. Thus, the first value is 1.8, which belongs to the row with index value 1.

Count Non-NaN Values

Problem: How to count the number of elements in a dataframe column that are not Nan?

import pandas as pd

df = pd.read_csv("Cars.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

df.count()[5]
print(result)
# 4

The method count() returns the number of non-NaN values for each column. The DataFrame df has five rows. The fifth column
contains one NaN value. Therefore, the count of the fifth column is 4.

Drop NaN-Values

Problem: How to drop all rows that contain a NaN value in any of its columns—and how to restrict this to certain columns?

import pandas as pd

df = pd.read_csv("Cars.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

selection1 = df.dropna(subset=["price"])
selection2 = df.dropna()
print(len(selection1), len(selection2))
# 5 4

The DataFrame’s dropna() method drops all rows that contain a NaN value in any of its columns. But how to restrict the columns to be scanned for NaN values?

By passing a list of column labels to the optional parameter subset, you can define which columns you want to consider.

The call of dropna() without restriction, drops line 2 because of the NaN value in the column engine-size. When you restrict the columns only to price, no rows will be dropped, because no NaN value is present.

Drop Nan and Reset Index

Problem: What happens to indices after dropping certain rows?

import pandas as pd

df = pd.read_csv("Cars.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

df.drop([0, 1, 2], inplace=True)
df.reset_index(inplace=True)
result = df.index.to_list()
print(result)
# [0, 1]

The method drop() on a DataFrame deletes rows or columns by index. You can either pass a single value or a list of values.

By default the inplace parameter is set to False, so that modifications won’t affect the initial DataFrame object. Instead, the method returns a modified copy of the DataFrame. In the puzzle, you set inplace to True, so the deletions are performed directly on the DataFrame.

After deleting the first three rows, the first two index labels are 3 and 4. You can reset the default indexing by calling the method reset_index() on the DataFrame, so that the index starts at 0 again. As there are only two rows left in the DataFrame, the result is [0, 1].

Concatenation of Dissimilar DataFrames Filled With NaN

Problem: How to concatenate two DataFrames if they have different columns?

import pandas as pd

df = pd.read_csv("Cars.csv")
df2 = pd.read_csv("Cars2.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

# Additional Dataframe "df2"
# ----------
#      make   origin
# 0   skoda  Czechia
# 1  toyota    Japan
# 2    ford      USA
# ----------

try:
    result = pd.concat([df, df2], axis=0, ignore_index=True)
    print("Y")
except Exception:
    print ("N")

# Y

Even if DataFrames have different columns, you can concatenate them.

If DataFrame 1 has columns A and B and DataFrame 2 has columns C and D, the result of concatenating DataFrames 1 and 2 is a DataFrame with columns A, B, C, and D. Missing values in the rows are filled with NaN.

Outer Merge

Problem: When merging (=joining) two DataFrames—what happens if there are missing values?

import pandas as pd

df = pd.read_csv("Cars.csv")
df2 = pd.read_csv("Cars2.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

# Additional dataframe "df2"
# ----------
#     make   origin
# 0  skoda  Czechia
# 1  mazda    Japan
# 2   ford      USA
# ----------

result = pd.merge(df, df2, how="outer", left_on="make", right_on="make")
print(len(result["fuel"]))
print(result["fuel"].count())
# 7
# 5

With Panda’s function merge() and the parameter how set to outer, you can perform an outer join.

The resulting DataFrame of an outer join contains all values from both input DataFrames; missing values are filled with NaN.

In addition, this puzzle shows how NaN values are counted by the len() function whereas the method count() does not include NaN values.

Replacing NaN

Problem: How to Replace all NaN values in a DataFrame with a given value?

import pandas as pd

df = pd.read_csv("Cars.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

df.fillna(2.0, inplace=True)
result = df["engine-size"].sum()
print(result)
# 13.8

The method fillna() replaces NaN values with a new value. Thus, the sum of all values in the column engine-size is 13.8.

Length vs. Count Difference — It’s NaN!

Problem: What’s the difference between the len() and the count() functions?

import pandas as pd

df = pd.read_csv("Cars.csv")
df2 = pd.read_csv("Cars2.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

# Additional dataframe "df2"
# ----------
#     make   origin
# 0  skoda  Czechia
# 1  mazda    Japan
# 2   ford      USA
# ----------

result = pd.merge(df2, df, how="left", left_on="make", right_on="make")
print(len(result["fuel"]))
print(result["fuel"].count())
# 3
# 1

In a left join, the left DataFrame is the master, and all its values are included in the resulting DataFrame.

Therefore, the result DataFrame contains three rows, yet, since skoda and ford don’t appear in DataFrame df, only one the row for mazda contains value.

Again, we see the difference between using the function len() which also includes NaN values and the method count() which does not count NaN values.

Equals() vs. == When Comparing NaN

Problem:

import pandas as pd

df = pd.read_csv("Cars.csv")

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

df["engine-size_copy"] = df["engine-size"]
check1 = (df["engine-size_copy"] == df["engine-size"]).all()
check2 = df["engine-size_copy"].equals(df["engine-size"])
print(check1 == check2)
# False

This code snippet shows how to compare columns or entire DataFrames regarding the shape and the elements.

The comparison using the operator == returns False for our DataFrame because the comparing NaN-values with == always yields False.

On the other hand, df.equals() allows comparing two Series or DataFrames. In this case, NaN-values in the same location are considered to be equal.

The column headers do not need to have the same type, but the elements within the columns must be of the same dtype.

Since the result of check1 is False and the result of check2 yields True, the final output is False.

Where to Go From Here?

Enough theory. Let’s get some practice!

Coders get paid six figures and more because they can solve problems more effectively using machine intelligence and automation.

To become more successful in coding, solve more real problems for real people. That’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

You build high-value coding skills by working on practical coding projects!

Do you want to stop learning with toy projects and focus on practical code projects that earn you money and solve real problems for people?

🚀 If your answer is YES!, consider becoming a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

If you just want to learn about the freelancing opportunity, feel free to watch my free webinar “How to Build Your High-Income Skill Python” and learn how I grew my coding business online and how you can, too—from the comfort of your own home.

Join the free webinar now!