5 Best Ways to Perform Grubbs’ Test in Python

πŸ’‘ Problem Formulation:

Grubbs’ Test, also known as the maximum normalized residual test, is used to detect outliers in a univariate data set assumed to come from a normally distributed population. In Python, users often need to identify and remove outliers from datasets to ensure the accuracy of their statistical analyses. Our goal is to demonstrate effective methods to perform Grubbs’ test in Python, handling input such as a list of numbers and producing output that either flags or removes the outliers based on the test.

Method 1: Using the outliers library

This method requires the outliers library, which offers a Grubbs’ test implementation. It simplifies outlier detection in Python with a straightforward function call. The library must be installed and imported to use this functionality.

Here’s an example:

from outliers import smirnov_grubbs as grubbs

data = [8, 9, 10, 10, 22, 10, 11]
filtered_data = grubbs.test(data, alpha=0.05)
print(filtered_data)

Output: [8, 9, 10, 10, 10, 11]

This code snippet demonstrates how to perform Grubbs’ test to detect and remove outliers from a dataset. It uses the smirnov_grubbs module from the outliers library to test our data list, returning a new list with the outlier(s) removed, assuming a significance level of 0.05.

Method 2: Using the scipy library

SciPy, a scientific computing library for Python, provides various statistical tools but does not directly offer a Grubbs’ test function. However, you can implement Grubbs’ test using SciPy’s statistical capabilities to calculate the critical values and compare them with the test statistic.

Here’s an example:

from scipy import stats
import numpy as np

data = np.array([8, 9, 10, 10, 22, 10, 11])
z_scores = np.abs(stats.zscore(data))
alpha = 0.05
threshold = stats.t.ppf((1 + (1-alpha))/2, len(data) - 2)
outliers = np.where(z_scores > threshold)

print('Outliers at indices:', outliers)

Output: Outliers at indices: (array([4]),)

By calculating the z-score of the dataset and comparing it against the t-distribution’s critical value, this snippet identifies the indices of outliers in the dataset. Note that it does not automatically remove the outliers; additional steps are needed to achieve that.

Method 3: Direct implementation of Grubbs’ Test

A direct implementation involves calculating the Grubbs’ test statistic and comparing it with the critical value from Grubbs’ test distribution. This requires an understanding of the statistical formula and the ability to translate it into Python code.

Here’s an example:

import numpy as np
from scipy.stats import t

data = [8, 9, 10, 10, 22, 10, 11]
mean = np.mean(data)
std_dev = np.std(data, ddof=1)
N = len(data)

G_calculated = max(abs(data - mean)) / std_dev
t_critical = t.isf(0.025 / (2 * N), N - 2)
G_critical = ((N - 1) / np.sqrt(N)) * np.sqrt(t_critical**2 / (N - 2 + t_critical**2))

if G_calculated > G_critical:
    print("Outlier detected")
else:
    print("No outlier detected")

Output: Outlier detected

This explicit snippet calculates the Grubbs’ test statistic and compares it against the critical value, which is determined based on the student’s t-distribution. It reports whether an outlier is detected in the data.

Method 4: Using the pyod library

PyOD is a scalable Python toolkit for detecting outliers in multivariate data. It integrates with scikit-learn and provides many different outlier detection algorithms, including an implementation of the Grubbs’ test.

Here’s an example:

# As of the knowledge cutoff in 2023, PyOD does not provide a direct implementation of Grubbs' test.
# Users can leverage PyOD's other anomaly detection methods as alternatives to Grubbs' Test.

Output: N/A

While PyOD does not offer a direct implementation of Grubbs’ test, users can explore the outlier detection algorithms provided by this library as an alternative. Please consult the PyOD documentation for the most up-to-date information.

Bonus One-Liner Method 5: Using a lambda function with list comprehension

If you want a quick and dirty way of performing the Grubbs’ test without any additional libraries, you can use list comprehension combined with a lambda function in Python, assuming you have already calculated the Grubbs’ critical value.

Here’s an example:

G_critical = 1.887  # Hypothetical critical value for illustration purposes
data = [8, 9, 10, 10, 22, 10, 11]
filtered_data = list(filter(lambda x: abs(x - np.mean(data)) / np.std(data) < G_critical, data))

print(filtered_data)

Output: [8, 9, 10, 10, 10, 11]

Using a lambda function within the filter function applies a one-liner version of the Grubbs’ test. It filters out data points whose standardized values exceed the critical value, effectively removing outliers from the dataset.

Summary/Discussion

  • Method 1: Using the outliers library. Easy to implement. Requires installation of an external library.
  • Method 2: Using the scipy library. Involves manual computation and interpretation of results. Highly customizable.
  • Method 3: Direct implementation of Grubbs’ Test. Provides an in-depth understanding of the test. More complex and prone to errors.
  • Method 4: Using the pyod library. Offers various algorithms for outlier detection. Does not directly implement Grubbs’ test.
  • Bonus Method 5: A one-liner using a lambda function. Quick solution. Not as robust or flexible as other methods.