With Exploratory Data Analysis (EDA) functions in Python, it is easy to get a quick overview of a dataset. The EDA’s goal is the statistical summary and graphical visualization of a dataset. This will help to discover patterns, missing values and help to extract further information for statistical modeling.
The first step in the data analysis process is to get an overview of the data and its structure. This is done by statistical summaries and graphical visualization like bar charts and plots.
Load the Iris Dataset
We will work with the famous Iris dataset, which is publicly available.
The dataset was collected in 1936 by R.A. Fisher and contains data on three species of iris flowers. For this purpose, we want to know how the three species differ.
First, we will load the most important libraries for numerical data and data wrangling, NumPy and pandas. For visualization, we will load Matplotlib and Seaborn, and then the dataset. With
df.head() we get a first glance at the first 5 rows of the dataset.
# Load important libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Load data iris = pd.read_csv("iris.csv") iris.head()
Basic Python Functions to Examine and Describe Data
df.info() we will get an overview of the variables, count, and classes.
df.dtypes will also determine the class types of the variables in the data set. However,
df.info() also gives us the insight that there are no missing values in the dataset.
df.describe() we get an overview of the basic descriptive statistics mean, the standard deviation, minimum, and maximum, and percentiles for every variable. This is important to know the distribution of the categories.
#Describe the dataset iris.describe()
For a dataset with non-numerical data you can get an overview including all variables with
df.describe(include = all).
Python Pandas Missing Values
Before cleaning the dataset, an important step is to look for outliers, the distribution and missing values.
df.info()already gives an insight into the missing values.
df.isnull().sum()we can also sum up the missing values in the dataset.
# Checking for missing values iris.isnull().sum()
There are no missing values in our dataset.
If we find missing values in a dataset, we can decide in the data cleaning process to delete them, impute them, or leave them in.
In many statistical tests, missing values are dropped by default. However, for most machine learning algorithms it is essential to clean them up beforehand.
Python Pandas Outliers and Normal Distribution
The DataFrame method
df.describe() already gave us some insight into the spread and potential outliers. But aside from the mean, standard deviation, and min and max, we’d like more descriptive statistics on the distribution. Pandas has many more functions for further displaying the descriptive statistics of our data.
kurtosis to determine if the data in a variable is normally distributed or skewed. Then we test the hypothesis of a normal or non-normal distribution with the Shapiro-Wilk test.
For the petal length, the spread between the min and the max and the standard deviation is bigger than for the other variables. That tells us that there is a larger distribution here.
The mean, median, and mode differ extremely which already shows that there can’t be a normal distribution in this variable, or they would be roughly the same.
- The mean is just the average of the values.
- The median is the value exactly in the middle, where half of the values lie above and the other half beneath the median.
- The mode is the most frequent value in the variable.
The negative value for kurtosis tells us that the distribution of the values in the variable is wider than the normal distribution. If the distribution curve is steeper, the value would be positive.
The negative value for the skewness tells us that the distribution is skewed to the left, but not a lot (the normal distribution would be 0).
Normal Distribution Visualization
sns.displot() visualizes the distribution, the
kde=True extension draws a normal distribution curve over the data.
The graph shows us that the variable is clearly not normally distributed, and we have some high values towards the end. We assume that all these are possible and within the variability of measurements of a petal length, so we do not need to remove any outliers.
Several statistical tests require a normal distribution. With the Shapiro-Wilk test, we can test our assumption that the values are not normally distributed.
from scipy.stats import shapiro shapiro(iris["PetalLengthCm"])
The p-value is so small that we can reject the hypothesis that the data is normally distributed. If it were p < 0.05 we could assume a normal distribution.
If the test we chose (like regression or t-test) requires a normal distribution of the data, we have two options:
- we either choose a non-parametric test option that does not need a normal distribution or
- we can standardize (z-transform) the data.
Python EDA Visualization
There are many great ways to get a graphical overview of the data. We know that the three species of iris flowers are the main distinctive variable of the dataset. With
sns.displot() we can get a distribution plot of the species.
sns.countplot() is a great way to visualize Panda’s
These plots however do not give us much information except that there are 50 observations per species in the data set.
A scatterplot with
sns.scatterplot() that differentiates the categories between the species will be better. So, we will look at the distribution of the sepal length and width and the petal length and width, colored by the variable “Species” to distinguish the difference in size among the three different kinds of iris.
sns.scatterplot(data=iris, x="SepalLengthCm",y="SepalWidthCm", hue="Species")
sns.scatterplot(data=iris, x="PetalLengthCm",y="PetalWidthCm", hue="Species")
Sns.PairGrid() is an easy and quick overview of every combination of variables in our data set. It is just the right function for a graphical EDA.
eda = sns.PairGrid(iris, hue="Species") eda.map_diag(sns.histplot) eda.map_offdiag(sns.scatterplot)
The Pair Grid already shows us that we should consider regression and clustering techniques on the iris dataset. Especially petal length and petal width show a clear linear relationship that we can test with a correlation. A correlation heatmap will visualize this assumption.
The correlation heatmap can be constructed by a combination of a correlation matrix (easily done with
df.corr()) and a
This proves a strong correlation between the petal length and the petal width. There is however no correlation between the sepal length and the sepal width.
Python Pandas profiling – All in one EDA
A really easy way to do EDA in one line is with pandas profiling. The profiling report produces one complete exploratory analysis of all the variables in the dataset, including the correlation heatmap.
Pandas profiling can be installed with:
pip install pandas-profiling
Working in Jupyter, the report can be generated directly into the notebook.
import pandas_profiling as pp pp.ProfileReport(iris)
Next steps and further techniques for data exploration
The next steps in the data analysis process can be clustering and dimension reduction techniques, regression, or hypothesis testing.
The Pair Grid already shows that we should consider regression and clustering techniques on the iris dataset.
All these techniques can also be used for exploratory data analysis if there is no clear research question or hypothesis for the modeling process. This is mainly not recommended in research but common in data mining to draw insight from existing data e.g. from a company.