Can you spot the outliers in the following sequence: 000000001000000001? Detecting outliers fast can be mission critical for many applications in military, air transport, and energy production.
This article shows you the most basic outlier detection algorithm: if an observed value deviates from the mean by more than the standard deviation, it is considered an outlier.
The NumPy Basics of our Simple Outlier Detection Algorithm
In order to solve the outlier detection problem, let us first study a few basics required to understand the one-liner solution at the end of this article.
First, let’s study what exactly is an outlier. In this article, we make the basic assumption that all observed data is normally distributed around a mean value. For example, consider the following sequence:
[ 8.78087409 10.95890859 8.90183201 8.42516116 9.26643393 12.52747974 9.70413087 10.09101284 9.90002825 10.15149208 9.42468412 11.36732294 9.5603904 9.80945055 10.15792838 10.13521324 11.0435137 10.06329581 ... 10.74304416 10.47904781]
If you plot this sequence, you’ll get the following figure:
The sequence seems to resemble a normal distribution with a mean value of 10 and a standard deviation of 1. The mean is the average value of all sequence values. The standard deviation is the deviation from the mean so that approximately 68% of all sample values are lying within the standard deviation interval. In the following, we simply assume: any observed value that is outside of the interval marked by the standard deviation around the mean is an outlier.
Before we move on, let’s quickly explore the simple code snippet I used to generate the plot. Can you find the location in the code where I defined the mean and standard deviation?
import numpy as np import matplotlib.pyplot as plt sequence = np.random.normal(10.0, 1.0, 500) print(sequence) plt.xkcd() plt.hist(sequence) plt.annotate(r"$\omega_1=9$", (9, 70)) plt.annotate(r"$\omega_2=11$", (11, 70)) plt.annotate(r"$\mu=10$", (10, 90)) plt.savefig("plot.jpg") plt.show()
Second, the following NumPy function creates a new NumPy array with each negative value made positive (absolute values):
import numpy as np a = np.array([1, -1, 2, -2]) print(a) # [ 1 -1 2 -2] print(np.abs(a)) # [1 1 2 2]
You can see that the function “np.abs()” simply converts the negative values in a NumPy array into their positive counterparts.
Third, the following NumPy function performs an element-wise logical and operation.
import numpy as np a = np.array([True, True, True, False]) b = np.array([False, True, True, False]) print(np.logical_and(a, b)) # [False True True False]
Each element at index i of array a is combined with element i of array b using the logical and operation (which only returns “True” if both operands are already “True”).
With this information, you are now equipped to fully understand the following one-liner code snippet.
Detect Outliers in Website Analytics (One-Liner)
The one-liner examines the following problem: “Find all outlier days which statistics (columns) deviate more than the standard deviation from their mean statistics”
## Dependencies import numpy as np ## Website analytics data: ## (row = day), (col = users, bounce, duration) a = np.array([[815, 70, 115], [767, 80, 50], [912, 74, 77], [554, 88, 70], [1008, 65, 128]]) mean, stdev = np.mean(a, axis=0), np.std(a, axis=0) # [811.2 76.4 88. ], [152.97764543 6.85857128 29.04479299] ## One-liner outliers = ((np.abs(a[:,0] - mean) > stdev) * (np.abs(a[:,1] - mean) > stdev) * (np.abs(a[:,2] - mean) > stdev)) ## Result print(a[outliers])
Can you guess the output of this code snippet?
Discussion of the Results
Imagine you are the administrator of an online application and you need to analyze the website traffic on a continuous basis. As the administrator of the Python web application Finxter.com, this is one of my daily activities.
The data set consists of multiple rows and columns. Each row comprises of daily statistics consisting of three columns (daily active users, bounce rate, and average session duration in seconds).
For each column (statistically tracked metric), we calculate the mean value and the standard deviation. For example, the mean value of the “daily active users” column is 811.2 and its standard deviation is 152.97. Note that we use the axis argument to calculate the mean and standard deviation of each column separately.
Recall that our goal is to detect outliers. But how to do this for our website analytics? The code simply assumes that every observed value that does not fall within the standard deviation around the mean of each specific column is an outlier. It’s that simple.
For example, the average value of the “daily active users” column is 811.2 and its standard deviation is 152.97. Thus, every observed value for the “daily active users” metric that is smaller than 811.2-152.97=658.23 or larger than 811.2+152.23=963.43 is considered an outlier for this column.
However, we consider a day to be an outlier only if all three observed columns are outliers. It’s easy to achieve this by combining the three Boolean arrays using the “logical and” operation of NumPy.
Where to go from here?
This article gave you a practical example
I have written a free NumPy tutorial on my blog. Check it out!