# Normal Distribution and Shapiro-Wilk Test in Python

Normal distribution is a statistical prerequisite for parametric tests like Pearson’s correlation, t-tests, and regression.

• Testing for normal distribution can be done visually with `sns.displot(x, kde=true)`.
• The Shapiro-Wilk test for normality can be done quickest with `pingouin`‘s `pg.normality(x)`.

💡 Note: Several publications note that normal distribution is the least important prerequisite for parametric tests and with large sample sizes you can assume normal distribution. Check this paper for more details.

## Python Libraries for Normal Distribution and Shapiro-Wilk

We import pingouin, seaborn and SciPy. SciPy is the standard package for statistical tests and `pingouin` is a package for quick one-line statistical tests.

```import pandas as pd
import pingouin as pg
import seaborn as sns
import scipy as scipy
```

## Method 1: Seaborn

We load the dataset about different species and sizes of penguins from seaborn.

```penguins = sns.load_dataset('penguins')

We’ll check out the bill length of the penguins more closely. With Seaborn, we can plot a distribution curve over our data.

A normal distribution will have the shape of the gaussian curve. That is why a distribution plot is a great way to determine normal distribution visually as it can be seen right away if it is a bell curve or not.

`sns.displot(penguins["bill_length_mm"], kde=True)`

Output:

This curve does not look normally distributed, but close.

The Shapiro-Wilk test is a test for normal distribution and can confirm our assumption.

The hypothesis for the test are:

• H0: Our data is normally distributed.
• H1: Our data is not normally distributed.

If the test is significant, we’ll have to reject H0, meaning that we assume H1 is true, and the data is not normally distributed.

## Method 2: Shapiro-Wilk Test with Pingouin

With the package `pingouin`, we can have a quick test output. For instance, the function call `pg.normality(x)` will give us the results of the Shapiro-Wilk test while automatically dropping missing values.

Here’s an example for testing normality on the `penguins` dataset previously instantiated:

`pg.normality(penguins["bill_length_mm"])`

The p-value is significant, so we will reject the H0 assumption that our data is normally distributed and confirm our visual assumption of non-normal distribution.

## Method 3: Shapiro-Wilk Test in SciPy

The Shapiro-Wilk test can also be done with `scipy.stats.shapiro(x)`. However, SciPy does not automatically drop missing values so the test will be invalid. Therefore, we must drop them beforehand.

```bill_length = penguins["bill_length_mm"].dropna()
scipy.stats.shapiro(bill_length)```

Output:

This delivers the same results and confirms our assumption of a not normally distributed variable.

## Normal Distribution on the Iris Dataset

A normal distributed variable would look more like the sepal width from the iris dataset:

```iris = sns.load_dataset('iris')
sns.displot(iris["sepal_width"], kde=True)```

Output:

`pg.normality(iris["sepal_width"])`

Output:

`scipy.stats.shapiro(iris["sepal_width"])`

Output:

Here, the Shapiro-Wilk test is not significant, so we assume H0 is correct and the data normally distributed.

If you want to apply parametric tests to your data like a Pearson regression you mostly still can, as normal distribution is not a hard prerequisite and large datasets tend to be normally distributed.

You can also z-transform and normalize your data so the values have the same mean and standard deviation. This is especially useful for machine learning algorithms.

## Programmer Humor

``````Q: How do you tell an introverted computer scientist from an extroverted computer scientist?

A: An extroverted computer scientist looks at your shoes when he talks to you.``````