5 Pythonic Ways to Detect Outliers in One Dimensional Observation Data Using Matplotlib

💡 Problem Formulation: Detecting outliers in a dataset is a critical step in data pre-processing that can greatly influence the performance of statistical analyses or machine learning models. Given a one-dimensional array of numerical data, we aim to identify values that significantly deviate from the majority of the data distribution. The desired output is a visualization indicating the location of outliers in the context of the broader dataset. This article elucidates the Pythonic methodologies for outlier detection using the Matplotlib library.

Method 1: Boxplot Visualization

The boxplot is a standardized way of displaying the dataset based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. In the context of outlier detection, any data point outside 1.5 times the interquartile range (IQR) above the third quartile and below the first quartile is traditionally considered an outlier.

Here’s an example:

import matplotlib.pyplot as plt

data = [12, 2, 7, 4, 26, 6, 8, 100, 5, 9]

plt.boxplot(data, vert=False)
plt.title('Boxplot for Outlier Detection')
plt.show()

The output is a horizontal boxplot with the bulk of data represented by a box, the median by a line within the box, and potential outliers as individual points beyond the whiskers.

This plot clearly highlights the outliers. The number 100 is outside of the upper whisker, indicating it is a statistical outlier in this dataset according to the IQR method.

Method 2: Z-score Detection

The Z-score method measures the number of standard deviations an element is from the mean. Observations with a Z-score greater than an absolute value of 3 are conventionally considered outliers.

Here’s an example:

import numpy as np
import matplotlib.pyplot as plt

data = np.array([12, 2, 7, 4, 26, 6, 8, 100, 5, 9])
mean, std = np.mean(data), np.std(data)

z_scores = (data - mean) / std
outliers = data[np.abs(z_scores) > 3]
plt.plot(data, 'bo')
plt.plot(np.where(np.abs(z_scores) > 3), outliers, 'ro')
plt.title('Outlier Detection using Z-score')
plt.show()

The output displays a scatter plot where the data points considered to be outliers are marked in red.

Using this piece of code, data points that have a Z-score greater than 3 or less than -3 are plotted as red dots, indicating their status as outliers. In our example, the value 100 is marked as an outlier.

Method 3: Modified Z-score Detection

The modified Z-score method is similar to the Z-score but uses the median and median absolute deviation (MAD), which is less affected by outliers in the dataset. A data point with a modified Z-score above 3.5 is usually considered an outlier.

Here’s an example:

import numpy as np
import matplotlib.pyplot as plt

data = np.array([12, 2, 7, 4, 26, 6, 8, 100, 5, 9])
median = np.median(data)
mad = np.median(np.abs(data - median))
modified_z_scores = 0.6745 * (data - median) / mad
outliers = data[np.abs(modified_z_scores) > 3.5]
plt.plot(data, 'bo')
plt.plot(np.where(np.abs(modified_z_scores) > 3.5), outliers, 'ro')
plt.title('Outlier Detection using Modified Z-score')
plt.show()

The output shows a scatter plot with the potential outliers marked in red.

This method is more robust to outliers and provides a visualization highlighting potential outliers using the median and MAD. Here again, the value 100 is marked as an outlier due to its high modified Z-score.

Method 4: Scatter Plot Visualization for Outlier Investigation

A simple scatter plot can provide a visual indication of outliers by plotting the individual data points. While not providing a statistical test, it allows for a quick visual analysis.

Here’s an example:

import matplotlib.pyplot as plt

data = [12, 2, 7, 4, 26, 6, 8, 100, 5, 9]
plt.plot(data, 'bo')
plt.title('Scatter Plot for Outlier Detection')
plt.show()

The output is a scatter plot where each point represents a value in the dataset, with potential outliers visible as points isolated from the cluster.

This code creates a scatter plot of all the points in the dataset, which may help to quickly identify outliers by sight, though it does not give a precise indication of statistical outlierness.

Bonus One-Liner Method 5: NumPy Percentile-based Outlier Detection

Outliers can be detected by determining whether they fall outside of the range specified by a chosen percentile. For instance, values below the 1st percentile or above the 99th percentile might be considered outliers.

Here’s an example:

import numpy as np
import matplotlib.pyplot as plt

data = np.array([12, 2, 7, 4, 26, 6, 8, 100, 5, 9])
lower_bound, upper_bound = np.percentile(data, [1, 99])
outliers = data[(data  upper_bound)]
plt.plot(data, 'bo')
plt.plot(np.where((data  upper_bound)), outliers, 'ro')
plt.title('Percentile-based Outlier Detection')
plt.show()

The output is a scatter plot that marks the outliers in red, detected based on the percentile range.

In a succinct code line, percentiles are computed using NumPy, and any points outside these bounds are plotted in red, signifying they are considered outliers by this criterion.

Summary/Discussion

Method 1: Boxplot Visualization. Strengths: Provides a clear visual indication of statistically-defined outliers. Weaknesses: The arbitrary selection of 1.5 times the IQR might not suit all datasets.
Method 2: Z-score Detection. Strengths: Can identify outliers based on deviation from the mean. Weaknesses: Affected by the presence of outliers in the data which can skew the mean and the standard deviation.
Method 3: Modified Z-score Detection. Strengths: More robust to outliers than traditional Z-score due to medians. Weaknesses: Still requires the setting of a threshold, which may be somewhat arbitrary.
Method 4: Scatter Plot Visualization. Strengths: Simple and straightforward visualization of all data points for outlier identification. Weaknesses: Lacks statistical rigor for outlier determination and is subjective.
Bonus Method 5: NumPy Percentile-based Detection. Strengths: Establishes outliers based on the empirical distribution of data. Weaknesses: Percentile cutoffs are arbitrary and may not capture context-specific nuances.