Normal Distribution and Shapiro-Wilk Test in Python

5/5 - (2 votes)

Normal distribution is a statistical prerequisite for parametric tests like Pearson’s correlation, t-tests, and regression.

  • Testing for normal distribution can be done visually with sns.displot(x, kde=true).
  • The Shapiro-Wilk test for normality can be done quickest with pingouin‘s pg.normality(x).

πŸ’‘ Note: Several publications note that normal distribution is the least important prerequisite for parametric tests and with large sample sizes you can assume normal distribution. Check this paper for more details.

Python Libraries for Normal Distribution and Shapiro-Wilk

We import pingouin, seaborn and SciPy. SciPy is the standard package for statistical tests and pingouin is a package for quick one-line statistical tests.

import pandas as pd
import pingouin as pg
import seaborn as sns
import scipy as scipy

Method 1: Seaborn

We load the dataset about different species and sizes of penguins from seaborn. 

penguins = sns.load_dataset('penguins')
penguins.head() 

We’ll check out the bill length of the penguins more closely. With Seaborn, we can plot a distribution curve over our data.

A normal distribution will have the shape of the gaussian curve. That is why a distribution plot is a great way to determine normal distribution visually as it can be seen right away if it is a bell curve or not.

sns.displot(penguins["bill_length_mm"], kde=True)

Output:

This curve does not look normally distributed, but close.

The Shapiro-Wilk test is a test for normal distribution and can confirm our assumption.

The hypothesis for the test are:

  • H0: Our data is normally distributed.
  • H1: Our data is not normally distributed.

If the test is significant, we’ll have to reject H0, meaning that we assume H1 is true, and the data is not normally distributed. 

Method 2: Shapiro-Wilk Test with Pingouin

With the package pingouin, we can have a quick test output. For instance, the function call pg.normality(x) will give us the results of the Shapiro-Wilk test while automatically dropping missing values.

Here’s an example for testing normality on the penguins dataset previously instantiated:

pg.normality(penguins["bill_length_mm"])

The p-value is significant, so we will reject the H0 assumption that our data is normally distributed and confirm our visual assumption of non-normal distribution.

Method 3: Shapiro-Wilk Test in SciPy

The Shapiro-Wilk test can also be done with scipy.stats.shapiro(x). However, SciPy does not automatically drop missing values so the test will be invalid. Therefore, we must drop them beforehand.

bill_length = penguins["bill_length_mm"].dropna()
scipy.stats.shapiro(bill_length)

Output:

This delivers the same results and confirms our assumption of a not normally distributed variable.

Normal Distribution on the Iris Dataset

A normal distributed variable would look more like the sepal width from the iris dataset:

iris = sns.load_dataset('iris')
sns.displot(iris["sepal_width"], kde=True)

Output:

pg.normality(iris["sepal_width"])

Output:

scipy.stats.shapiro(iris["sepal_width"])

Output:

Here, the Shapiro-Wilk test is not significant, so we assume H0 is correct and the data normally distributed.

If you want to apply parametric tests to your data like a Pearson regression you mostly still can, as normal distribution is not a hard prerequisite and large datasets tend to be normally distributed. 

You can also z-transform and normalize your data so the values have the same mean and standard deviation. This is especially useful for machine learning algorithms.


Programmer Humor

Q: How do you tell an introverted computer scientist from an extroverted computer scientist?

A: An extroverted computer scientist looks at your shoes when he talks to you.