Normal distribution is a statistical prerequisite for parametric tests like Pearsonβs correlation, t-tests, and regression.

- Testing for normal distribution can be done visually with
`sns.displot(x, kde=true)`

. - The Shapiro-Wilk test for normality can be done quickest with
`pingouin`

‘s`pg.normality(x)`

.

π‘ **Note**: Several publications note that normal distribution is the least important prerequisite for parametric tests and with large sample sizes you can assume normal distribution. Check this paper for more details.

## Python Libraries for Normal Distribution and Shapiro-Wilk

We import pingouin, seaborn and SciPy. SciPy is the standard package for statistical tests and `pingouin`

is a package for quick one-line statistical tests.

import pandas as pd import pingouin as pg import seaborn as sns import scipy as scipy

## Method 1: Seaborn

We load the dataset about different species and sizes of penguins from seaborn.

penguins = sns.load_dataset('penguins') penguins.head()

Weβll check out the bill length of the penguins more closely. With Seaborn, we can plot a distribution curve over our data.

A normal distribution will have the shape of the gaussian curve. That is why a distribution plot is a great way to determine normal distribution visually as it can be seen right away if it is a bell curve or not.

sns.displot(penguins["bill_length_mm"], kde=True)

Output:

This curve does not look normally distributed, but close.

The **Shapiro-Wilk test** is a test for normal distribution and can confirm our assumption.

The hypothesis for the test are:

**H0**: Our data is normally distributed.**H1**: Our data is not normally distributed.

If the test is significant, weβll have to reject H0, meaning that we assume H1 is true, and the data is not normally distributed.

## Method 2: Shapiro-Wilk Test with Pingouin

With the package `pingouin`

, we can have a quick test output. For instance, the function call `pg.normality(x)`

will give us the results of the Shapiro-Wilk test while automatically dropping missing values.

Here’s an example for testing normality on the `penguins`

dataset previously instantiated:

pg.normality(penguins["bill_length_mm"])

The p-value is significant, so we will reject the H0 assumption that our data is normally distributed and confirm our visual assumption of non-normal distribution.

## Method 3: Shapiro-Wilk Test in SciPy

The Shapiro-Wilk test can also be done with `scipy.stats.shapiro(x)`

. However, SciPy does not automatically drop missing values so the test will be invalid. Therefore, we must drop them beforehand.

bill_length = penguins["bill_length_mm"].dropna() scipy.stats.shapiro(bill_length)

Output:

This delivers the same results and confirms our assumption of a not normally distributed variable.

## Normal Distribution on the Iris Dataset

A normal distributed variable would look more like the sepal width from the iris dataset:

iris = sns.load_dataset('iris') sns.displot(iris["sepal_width"], kde=True)

Output:

pg.normality(iris["sepal_width"])

Output:

scipy.stats.shapiro(iris["sepal_width"])

Output:

Here, the Shapiro-Wilk test is not significant, so we assume H0 is correct and the data normally distributed.

If you want to apply parametric tests to your data like a Pearson regression you mostly still can, as normal distribution is not a hard prerequisite and large datasets tend to be normally distributed.

You can also z-transform and normalize your data so the values have the same mean and standard deviation. This is especially useful for machine learning algorithms.

## Programmer Humor

**Q**: How do you tell an introverted computer scientist from an extroverted computer scientist?
**A**: An extroverted computer scientist looks at *your* shoes when he talks to you.