The Ultimate Guide to Data Cleaning in Python and Pandas

What is Data Cleaning?

Data cleaning describes the process of turning messy data into clean datasets that can be used for research and data science purposes. For example, tidy data will be in a wide format: every column contains a variable, and every row contains one case. Also, data cleaning means getting rid of corrupt data with very little utility.

📄 Related Tutorial: Data Preparation in Python

How to Clean Unstructured Data in Python?

Most data in the real world is messy and unstructured or semi-structured. Working in data science, most of your time will be spent on cleaning and structuring data.

In research, data from surveys or experiments is mostly already structured into Excel or CSV tables.

In companies, data can be stored in databases, Excel files, or distributed all over the company. This data can be in emails, documents, folders, images, and note apps. 

This study shows, that most companies are having problems handling unstructured or semi-structured data, and almost half of them don’t even know where their data is located.

Unstructured data includes videos, images, and text or speech messages. Unstructured data from the web is mainly acquired by web scraping.

Semi-structured data is data found in documents, emails, social media posts, and if acquired from the web, it can be in HTML, JSON, or any other web format. 

Is Web Scraping Legal?

🛑 Important: Even though web scraping is possible does not mean it is always legal!

If the data is publicly available and not copyrighted, it is mostly safe to scrape. But also pay attention to data privacy laws and do not scrape personal data.

Scraping data from social media websites, for example, is mostly illegal, as it is not publicly available without logging in and contains personal data.

There are also many services to get data via an API. To be safe, refer to this guide about what is legal when web scraping:

What is the Goal of Data Cleaning?

The goal of data cleaning and cleaning unstructured or semi-structured data is to create tidy data with which you can work. Tidy data will be in a wide format: every column contains a variable, and every row contains one case.

To demonstrate both perspectives, this article is divided into two parts:

  • First, we will scrape, load, and wrangle some semi-structured data from the web.
  • Second, we will clean this data. This second step is also valid for structured data as it is about finding missing, outliers and duplicates.

I recommend doing an exploratory data analysis before or during cleaning data to get a good feeling of the data you have. You can easily combine exploring and cleaning.

Python HTML scraping

First, we will scrape a table from Wikipedia of the largest cities in the world. Scraping from Wikipedia is legal because

  1. The data is publicly available 
  2. The data works under a creative commons deed, which means the content is free to copy, share and adapt

First, we load our packages:

import pandas as pd # pandas for data wrangling, cleaning, and analysis
import requests # for http requests
from bs4 import BeautifulSoup # html reading

The wiki page that holds the table we are looking for can be found here:

We pass this URL into requests. The table in the HTML page is within <table class> , and the tables in wiki pages are called wiki tables. We can check this by looking at the HTML page in our browser or the HTML text file later to confirm we pulled the right table. 

With requests.get(url).text, we pull the HTML from the page. 

url='https://en.wikipedia.org/wiki/List_of_largest_cities'
url_response=requests.get(url).text

BeautifulSoup will pull the data table from the HTML file and save us time. We will pass the url_response from our request into the html.parser. With soup.find() we can tell it to look exactly for the wikitable. The output also tells us the name of the table.

soup = BeautifulSoup(url_response.text, 'html.parser')
html_listofcities=soup.find('table',{'class':'wikitable'})
html_listofcities

Output: 

Then we use pandas to read the HTML file and turn it into a pandas data frame, just like we would load any other data into Python.

df=pd.read_html(str(listofcities))
cities=pd.DataFrame(df[0])
print(cities.head())

Output:

The data looks messy. Now we get to clean!

Python HTML Cleaning

Let us inspect our newfound data:

Cities

Output:

I will drop the columns of the metropolitan area and the urban area because I am just interested in the population of the actual city. This can be done in several ways.

Here are two:

cities.drop(cities.columns[[7, 8, 9, 10, 11, 12]], axis=1, inplace=True)
# or
cities= cities.drop(cities.iloc[:,7:], axis = 1)

Then we will drop the first header column as it does not contain any useful information and rename the remaining header column.

# Dropping the first index column
cities.columns = cities.columns.droplevel(0) 

Output:

# rename the header column
cities.columns = ['city', 'country', 'UN_2018_population_estimates', 
                  'city_definition', 'city_population', 
                  'city_area_km2', ' city_density/km2']

Output:

The heading looks clean. Now we explore the dataset to find information to clean.

With df.info() and df.describe() we get a quick overview of the data we scraped. 

cities.info()

Output:

cities.describe()

Output:

It is immediately clear that the city_density/km2 is not a float even though it is supposed to be numerical.

Inspecting the data frame, you might have already noticed that the columns contain numbers following numbers in brackets, like [12]. This turns this data into an object, so we will have to get rid of this.

There are different ways to remove characters from a string in Python. We could just remove the last three characters of each string.

However, this would not work if some of our data points do not have the brackets at the end or more than that. So we’ll use the slicing method str.partition() to cut the brackets from our numbers. 

First, we make sure our object type is a string that we can work string operations on. Then we apply the str.partition() method and advise the function to cut off at the first bracket [.

cities[' city_density/km2'] = cities['city_density/km2'].astype('string')
city_density_str = cities['city_density/km2'].str.partition('[')
print(city_density_str)

Output:

What we want is the first column, so we pick this one [0] and write it back on a variable. 

city_density_str_col_1 = city_density_str[0]
print(city_density_str_col_1)

Output:

The commas in the variable will prevent us from converting the string into a float, so we’ll remove the comma with str.replace() before turning the string to a float with s.astype('float') and assigning it back to our data frame.

city_density_str_col_1 = city_density_str_col_1.str.replace(',','')
cities['city_density/km2'] = city_density_str_col_1.astype('float')
print(cities['city_density/km2'])

Output:

The variable now shows up when we look at df.describe() and we’ll want the results rounded for better readability:

cities.describe().round(1)

Output:

Cleaning Structured Data in Python

Following the cleaning of the scraped data we can now use it like a structured data frame with data we collected or downloaded.

This also can be cleaned of missing data, outliers and duplicates but does not always need data wrangling. However, with a data frame with many strings the cleaning process also often involves a lot of string manipulation.

Important note:

If you want to apply machine learning algorithms to your data, do split your dataset before feature engineering and data transformation as this can create data leakage!

Dropping duplicates is easy

cities = cities.drop_duplicates()

There didn’t seem to be duplicates in our df, as the size remained the same.

When dealing with missing values, we must decide how to handle them based on our data. 

We can either

  • Drop missing values
  • Replace or impute the values
  • Leave missing values in the dataset 
  • Transform the information that they’re missing into a new variable

First, we inspect our missing data. The function df.isnull() is a boolean function, that tells us for the whole data frame if data is missing or not.

We can sum it up to determine, how many values are missing in each column.

cities.isnull().sum()

Output:

We can drop rows with missing values completely.

This will cause us to lose useful information in other columns. But as the first row is completely empty anyway, we can drop this one.

The df.dropna() function has useful features that help us pick what missing data we want to remove. So, I just want to remove the one row, or all of them if there are more, with all missing values.

cities = cities.dropna(how='all')

This will look like this:

What is left are the missing values for 8 cities for population, area and density. We will replace those. 

Of course, you can look up the data on Wikipedia and reinsert them. For the sake of the exercise and because most of the time it is not possible to look up missing data, we will not do this.

The dataset now contains the data of the 73 biggest cities in the world, using the average of these to impute the missing values in the other 8 is the only and closest guess we have. This does not create much more information but keeps us from losing other information from these 8 cities.

The alternative option would be to drop those 8 cities completely.

So, we’ll replace the missing values in the area column with the average area size of all the other cities. First, we create the mean of the city area sizes, then we fill the missing values in the column with this value.

Pandas has the right function for this: df.fillna()

city_area_mean = cities['city_area_km2'].mean()
cities['city_area_km2'] = cities['city_area_km2'].fillna(value=city_area_mean)
cities.isnull().sum()

Output:

The output shows that we now have replaced and eliminated several missing values.

For the population, we luckily have another column that shows the population estimates of the UN in 2018 for each city.

So, we can use these to impute the missing population data, as it is as close as we can get to replacing them accurately. 

cities['city_population'] = cities['city_population'].fillna(cities['UN_2018_population_estimates'])
cities['city_population'].describe().round(1)

Output:

Now we still have missing values in the city density column. This one we can calculate now by dividing the population by the area.

So, we create a new variable to calculate the density with our new imputed data. Then we fill the missing values with this calculated density.

city_density = cities['city_population']/cities['city_area_km2']
cities['city_density/km2'] = cities['city_density/km2'].fillna(value=city_density)
cities['city_density/km2'].describe().round(1)

Output:

We can check back our missing values and the description of our dataset.

cities.isnull().sum()
cities.describe().round(1)

There is still one value missing in our city definition. Let’s have a look at these categories. 

cities['city_definition'].value_counts()

Output:

As we don’t know if the missing city is a municipality or a capital, we could just replace the missing value with the generic description of “city”, as we know they all are cities.

If you’d want to calculate the differences between these categories, it would be useful to categorize and merge these single entries into bigger categories.

For now, we will just replace the missing value with “city”, as I am more interested in the size of the cities than the category.

cities['city_definition'] = cities['city_definition'].fillna('City')
cities.isnull().sum()

Great! We got rid of all the missing values.

💡 Info: For many statistical operations, missing values will be dropped by default and don’t create a problem. For machine learning algorithms missing values must be removed before modelling.

We can also create dummy variables (information is missing/ not missing) as the fact that the data is missing might be useful information. This way, the fact that they’re missing can be included in the data analysis process. 

Visualization

Now we visualize our data and check for outliers with a seaborn scatterplot.

import seaborn as sns
sns.scatterplot(data=cities, x="city_population",
                y="city_area_km2", size="city_population")

Output:

The city in the right top corner is clearly an outlier, but not one we would want to remove or equalize as it is not a measurement error. It is just the biggest city (or metropolitan area) in the world!

Let’s find out which one it is with df.sort_values(), using ascending=False to sort the city population from high to low.

cities.sort_values('city_population', ascending=False)

Output:

The biggest city on earth by size and city population is the municipality Chongqing in China with over 32 million inhabitants!

To visualize our cities more beautifully, we can use a sns.relplot() with color and different sizing.

sns.relplot(x="city_population", y="city_area_km2", hue="country", size="city_population",
            sizes=(40, 400), alpha=.5, palette="muted", height=6, data=cities)

Output:

Now keep on cleaning!