# Pandas NaN — Working With Missing Data

Rate this post

Pandas is Excel on steroids—the powerful Python library allows you to analyze structured and tabular data with surprising efficiency and ease. Pandas is one of the reasons why master coders reach 100x the efficiency of average coders. In today’s article, you’ll learn how to work with missing data—in particular, how to handle NaN values in Pandas DataFrames.

You’ll learn about all the different reasons why NaNs appear in your DataFrames—and how to handle them. Let’s get started!

## Checking Series for NaN Values

Problem: How to check a series for NaN values?

Have a look at the following code:

```import pandas as pd
import numpy as np

data = pd.Series([0, np.NaN, 2])
result = data.hasnans

print(result)
# True```

Series can contain `NaN`-values—an abbreviation for Not-A-Number—that describe undefined values.

To check if a Series contains one or more `NaN` value, use the attribute `hasnans`. The attribute returns `True` if there is at least one `NaN` value and `False` otherwise.

There’s a `NaN` value in the Series, so the output is `True`.

## Filtering Series Generates NaN

Problem: When filtering a Series with `where()` and no element passes the filtering condition, what’s the result?

```import pandas as pd

xs = pd.Series([5, 1, 4, 2, 3])
xs.where(xs > 2, inplace=True)
result = xs.hasnans

print(result)
# True```

The method `where()` filters a Series by a condition. Only the elements that satisfy the condition remain in the resulting Series. And what happens if a value doesn’t satisfy the condition? Per default, all rows not satisfying the condition are filled with `NaN`-values.

This is why our Series contains `NaN`-values after filtering it with the method `where()`.

## Working with Multiple Series of Different Lengths

Problem: If you element-wise add two Series objects with a different number of elements—what happens with the remaining elements?

```import pandas as pd

s = pd.Series(range(0, 10))
t = pd.Series(range(0, 20))
result = (s + t)

print(result)
# 2```

To add two Series element-wise, use the default addition operator `+`. The Series do not need to have the same size because once the first Series ends, the subsequent element-wise results are `NaN` values.

At index `1` in the resulting Series, you get the result of `1 + 1 = 2`.

## Create a DataFrame From a List of Dictionaries with Unequal Keys

Problem: How to create a DataFrame from a list of dictionaries if the dictionaries have unequal keys? A DataFrame expects the same columns to be available for each row!

```import pandas as pd

data = [{'Car':'Mercedes', 'Driver':'Hamilton, Lewis'},
{'Car':'Ferrari', 'Driver':'Schumacher, Michael'},
{'Car':'Lamborghini'}]

df = pd.DataFrame(data, index=['Rank 2', 'Rank 1', 'Rank 3'])
df.sort_index(inplace=True)
result = df['Car'].iloc

print(result)
# Ferrari```

You can create a DataFrame from a list of dictionaries. The dictionaries’ keys define the column labels, and the values define the columns’ entries. Not all dictionaries must contain the same keys. If a dictionary doesn’t contain a particular key, this will be interpreted as a `NaN`-value.

This code snippet uses string labels as index values to sort the DataFrame. After sorting the DataFrame, the row with index label `Rank 1` is at location `0` in the DataFrame and the value in the column `Car` is `Ferrari`.

## Sorting a DataFrame by Column with NaN Values

Problem: What happens if you sort a DataFrame by column if the column contains a `NaN` value?

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

selection = df.sort_values(by="engine-size")
result = selection.index.to_list()
print(result)
# 1```

In this code snippet, you sort the rows of the DataFrame by the values of the column `engine-size`.

The main point is that `NaN` values are always moved to the end in Pandas sorting. Thus, the first value is `1.8`, which belongs to the row with index value `1`.

## Count Non-NaN Values

Problem: How to count the number of elements in a dataframe column that are not `Nan`?

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

df.count()
print(result)
# 4```

The method `count()` returns the number of non-`NaN` values for each column. The DataFrame `df` has five rows. The fifth column
contains one `NaN` value. Therefore, the count of the fifth column is `4`.

## Drop NaN-Values

Problem: How to drop all rows that contain a `NaN` value in any of its columns—and how to restrict this to certain columns?

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

selection1 = df.dropna(subset=["price"])
selection2 = df.dropna()
print(len(selection1), len(selection2))
# 5 4```

The DataFrame’s `dropna()` method drops all rows that contain a `NaN` value in any of its columns. But how to restrict the columns to be scanned for `NaN` values?

By passing a list of column labels to the optional parameter `subset`, you can define which columns you want to consider.

The call of `dropna()` without restriction, drops line `2` because of the `NaN` value in the column `engine-size`. When you restrict the columns only to `price`, no rows will be dropped, because no `NaN` value is present.

## Drop Nan and Reset Index

Problem: What happens to indices after dropping certain rows?

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

df.drop([0, 1, 2], inplace=True)
df.reset_index(inplace=True)
result = df.index.to_list()
print(result)
# [0, 1]```

The method `drop()` on a DataFrame deletes rows or columns by index. You can either pass a single value or a list of values.

By default the `inplace` parameter is set to `False`, so that modifications won’t affect the initial DataFrame object. Instead, the method returns a modified copy of the DataFrame. In the puzzle, you set `inplace` to `True`, so the deletions are performed directly on the DataFrame.

After deleting the first three rows, the first two index labels are 3 and 4. You can reset the default indexing by calling the method `reset_index()` on the DataFrame, so that the index starts at 0 again. As there are only two rows left in the DataFrame, the result is `[0, 1]`.

## Concatenation of Dissimilar DataFrames Filled With NaN

Problem: How to concatenate two DataFrames if they have different columns?

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

# ----------
#      make   origin
# 0   skoda  Czechia
# 1  toyota    Japan
# 2    ford      USA
# ----------

try:
result = pd.concat([df, df2], axis=0, ignore_index=True)
print("Y")
except Exception:
print ("N")

# Y```

Even if DataFrames have different columns, you can concatenate them.

If DataFrame 1 has columns A and B and DataFrame 2 has columns C and D, the result of concatenating DataFrames 1 and 2 is a DataFrame with columns A, B, C, and D. Missing values in the rows are filled with `NaN`.

## Outer Merge

Problem: When merging (=joining) two DataFrames—what happens if there are missing values?

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

# ----------
#     make   origin
# 0  skoda  Czechia
# 1  mazda    Japan
# 2   ford      USA
# ----------

result = pd.merge(df, df2, how="outer", left_on="make", right_on="make")
print(len(result["fuel"]))
print(result["fuel"].count())
# 7
# 5```

With Panda’s function `merge()` and the parameter `how` set to `outer`, you can perform an outer join.

The resulting DataFrame of an outer join contains all values from both input DataFrames; missing values are filled with `NaN`.

In addition, this puzzle shows how `NaN` values are counted by the `len()` function whereas the method `count()` does not include `NaN` values.

## Replacing NaN

Problem: How to Replace all `NaN` values in a DataFrame with a given value?

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

df.fillna(2.0, inplace=True)
result = df["engine-size"].sum()
print(result)
# 13.8```

The method `fillna()` replaces `NaN` values with a new value. Thus, the sum of all values in the column `engine-size` is 13.8.

## Length vs. Count Difference — It’s NaN!

Problem: What’s the difference between the `len()` and the `count()` functions?

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

# ----------
#     make   origin
# 0  skoda  Czechia
# 1  mazda    Japan
# 2   ford      USA
# ----------

result = pd.merge(df2, df, how="left", left_on="make", right_on="make")
print(len(result["fuel"]))
print(result["fuel"].count())
# 3
# 1```

In a left join, the left DataFrame is the master, and all its values are included in the resulting DataFrame.

Therefore, the result DataFrame contains three rows, yet, since `skoda` and `ford` don’t appear in DataFrame `df`, only one the row for `mazda` contains value.

Again, we see the difference between using the function `len()` which also includes `NaN` values and the method `count()` which does not count `NaN` values.

## Equals() vs. == When Comparing NaN

Problem:

```import pandas as pd

# Dataframe "df"
# ----------
#       make    fuel aspiration   body-style   price  engine-size
# 0     audi     gas      turbo        sedan   30000          2.0
# 1    dodge     gas        std        sedan   17000          1.8
# 2    mazda  diesel        std        sedan   17000          NaN
# 3  porsche     gas      turbo  convertible  120000          6.0
# 4    volvo  diesel        std        sedan   25000          2.0
# ----------

df["engine-size_copy"] = df["engine-size"]
check1 = (df["engine-size_copy"] == df["engine-size"]).all()
check2 = df["engine-size_copy"].equals(df["engine-size"])
print(check1 == check2)
# False```

This code snippet shows how to compare columns or entire DataFrames regarding the shape and the elements.

The comparison using the operator `==` returns `False` for our DataFrame because the comparing `NaN`-values with `==` always yields `False`.

On the other hand, `df.equals()` allows comparing two Series or DataFrames. In this case, `NaN`-values in the same location are considered to be equal.

The column headers do not need to have the same type, but the elements within the columns must be of the same `dtype`.

Since the result of `check1` is `False` and the result of `check2` yields `True`, the final output is `False`.

## Where to Go From Here?

Enough theory. Let’s get some practice!

Coders get paid six figures and more because they can solve problems more effectively using machine intelligence and automation.

To become more successful in coding, solve more real problems for real people. That’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

You build high-value coding skills by working on practical coding projects!

Do you want to stop learning with toy projects and focus on practical code projects that earn you money and solve real problems for people?

🚀 If your answer is YES!, consider becoming a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

If you just want to learn about the freelancing opportunity, feel free to watch my free webinar “How to Build Your High-Income Skill Python” and learn how I grew my coding business online and how you can, too—from the comfort of your own home.

Join the free webinar now! 