Normal distribution is a statistical prerequisite for parametric tests like Pearson’s correlation, t-tests, and regression.
- Testing for normal distribution can be done visually with
- The Shapiro-Wilk test for normality can be done quickest with
💡 Note: Several publications note that normal distribution is the least important prerequisite for parametric tests and with large sample sizes you can assume normal distribution. Check this paper for more details.
Python Libraries for Normal Distribution and Shapiro-Wilk
We import pingouin, seaborn and SciPy. SciPy is the standard package for statistical tests and
pingouin is a package for quick one-line statistical tests.
import pandas as pd import pingouin as pg import seaborn as sns import scipy as scipy
Method 1: Seaborn
We load the dataset about different species and sizes of penguins from seaborn.
penguins = sns.load_dataset('penguins') penguins.head()
We’ll check out the bill length of the penguins more closely. With Seaborn, we can plot a distribution curve over our data.
A normal distribution will have the shape of the gaussian curve. That is why a distribution plot is a great way to determine normal distribution visually as it can be seen right away if it is a bell curve or not.
This curve does not look normally distributed, but close.
The Shapiro-Wilk test is a test for normal distribution and can confirm our assumption.
The hypothesis for the test are:
- H0: Our data is normally distributed.
- H1: Our data is not normally distributed.
If the test is significant, we’ll have to reject H0, meaning that we assume H1 is true, and the data is not normally distributed.
Method 2: Shapiro-Wilk Test with Pingouin
With the package
pingouin, we can have a quick test output. For instance, the function call
pg.normality(x) will give us the results of the Shapiro-Wilk test while automatically dropping missing values.
Here’s an example for testing normality on the
penguins dataset previously instantiated:
The p-value is significant, so we will reject the H0 assumption that our data is normally distributed and confirm our visual assumption of non-normal distribution.
Method 3: Shapiro-Wilk Test in SciPy
The Shapiro-Wilk test can also be done with
scipy.stats.shapiro(x). However, SciPy does not automatically drop missing values so the test will be invalid. Therefore, we must drop them beforehand.
bill_length = penguins["bill_length_mm"].dropna() scipy.stats.shapiro(bill_length)
This delivers the same results and confirms our assumption of a not normally distributed variable.
Normal Distribution on the Iris Dataset
A normal distributed variable would look more like the sepal width from the iris dataset:
iris = sns.load_dataset('iris') sns.displot(iris["sepal_width"], kde=True)
Here, the Shapiro-Wilk test is not significant, so we assume H0 is correct and the data normally distributed.
If you want to apply parametric tests to your data like a Pearson regression you mostly still can, as normal distribution is not a hard prerequisite and large datasets tend to be normally distributed.
You can also z-transform and normalize your data so the values have the same mean and standard deviation. This is especially useful for machine learning algorithms.
Q: How do you tell an introverted computer scientist from an extroverted computer scientist? A: An extroverted computer scientist looks at your shoes when he talks to you.