Can you spot the outliers in the following sequence: 000000001000000001? Detecting outliers fast can be mission critical for many applications in military, air transport, and energy production.
This article shows you the most basic outlier detection algorithm: if an observed value deviates from the mean by more than the standard deviation, it is considered an outlier. You can also watch the explainer video here:
What is an Outlier Anyways?
First, let’s study what exactly is an outlier. In this article, we make the basic assumption that all observed data is normally distributed around a mean value. For example, consider the following sequence:
[ 8.78087409 10.95890859 8.90183201 8.42516116 9.26643393 12.52747974 9.70413087 10.09101284 9.90002825 10.15149208 9.42468412 11.36732294 9.5603904 9.80945055 10.15792838 10.13521324 11.0435137 10.06329581 ... 10.74304416 10.47904781]
If you plot this sequence, you’ll get the following figure:
Here’s the code used to generate this plot:
import numpy as np import matplotlib.pyplot as plt sequence = np.random.normal(10.0, 1.0, 500) print(sequence) plt.xkcd() plt.hist(sequence) plt.annotate(r"$\omega_1=9$", (9, 70)) plt.annotate(r"$\omega_2=11$", (11, 70)) plt.annotate(r"$\mu=10$", (10, 90)) plt.savefig("plot.jpg") plt.show()
The sequence seems to resemble a normal distribution with a mean value of 10 and a standard deviation of 1.
The mean is the average value of all sequence values.
The standard deviation is the deviation from the mean so that approximately 68% of all sample values are lying within the standard deviation interval.
In the following, we simply assume that any observed value that is outside of the interval marked by the standard deviation around the mean is an outlier.
Method 1: Detect Outliers in Website Analytics (One-Liner)
Imagine you are the administrator of an online application and you need to analyze the website traffic on a continuous basis. As the administrator of the Python web application Finxter.com, this is one of my daily activities.
This one-liner examines the following problem: “Find all outlier days which statistics (columns) deviate more than the standard deviation from their mean statistics”
## Dependencies import numpy as np ## Website analytics data: ## (row = day), (col = users, bounce, duration) a = np.array([[815, 70, 115], [767, 80, 50], [912, 74, 77], [554, 88, 70], [1008, 65, 128]]) mean, stdev = np.mean(a, axis=0), np.std(a, axis=0) # Mean: [811.2 76.4 88. ] # Std: [152.97764543 6.85857128 29.04479299] ## Find Outliers outliers = ((np.abs(a[:,0] - mean[0]) > stdev[0]) * (np.abs(a[:,1] - mean[1]) > stdev[1]) * (np.abs(a[:,2] - mean[2]) > stdev[2])) ## Result print(a[outliers])
The data set consists of multiple rows and columns. Each row comprises daily statistics consisting of three columns (daily active users, bounce rate, and average session duration in seconds).
For each column (statistically tracked metric), we calculate the mean value and the standard deviation. For example, the mean value of the “daily active users” column is 811.2 and its standard deviation is 152.97. Note that we use the axis argument to calculate the mean and standard deviation of each column separately.
Recall that our goal is to detect outliers. But how to do this for our website analytics? The code simply assumes that every observed value that does not fall within the standard deviation around the mean of each specific column is an outlier. It’s that simple.
For example, the average value of the “daily active users” column is 811.2 and its standard deviation is 152.97. Thus, every observed value for the “daily active users” metric that is smaller than 811.2-152.97=658.23 or larger than 811.2+152.23=963.43 is considered an outlier for this column.
However, we consider a day to be an outlier only if all three observed columns are outliers. It’s easy to achieve this by combining the three Boolean arrays using the “logical and” operation of NumPy. The logical and can be replaced with a simple multiplication scheme as True is represented by an integer 1 and False by an integer 0.
We use np.abs()
in the code snippet that simply converts the negative values in a NumPy array into their positive counterparts.
This article is based on my book—I’ll show you the next method to detect outliers in a moment.
Check out my new Python book Python One-Liners (Amazon Link).
If you like one-liners, you’ll LOVE the book. It’ll teach you everything there is to know about a single line of Python code. But it’s also an introduction to computer science, data science, machine learning, and algorithms. The universe in a single line of Python!
The book was released in 2020 with the world-class programming book publisher NoStarch Press (San Francisco).
Publisher Link: https://nostarch.com/pythononeliners
Method 2: IQR
This method from this GitHub code base uses the Interquartile range to remove outliers from the data x. This excellent video from Khan Academy explains the idea quickly and effectively:
The following code snippet remove outliers using NumPy:
import numpy as np def removeOutliers(x, outlierConstant): a = np.array(x) upper_quartile = np.percentile(a, 75) lower_quartile = np.percentile(a, 25) IQR = (upper_quartile - lower_quartile) * outlierConstant quartileSet = (lower_quartile - IQR, upper_quartile + IQR) resultList = [] for y in a.tolist(): if y >= quartileSet[0] and y <= quartileSet[1]: resultList.append(y) return resultList
Method 3: Remove Outliers From NumPy Array Using np.mean() and np.std()
This method is based on the useful code snippet provided here.
To remove an outlier from a NumPy array, use these five basic steps:
- Create an array with outliers
- Determine mean and standard deviation
- Normalize array around 0
- Define the maximum number of standard deviations
- Access only non-outliers using Boolean Indexing
import numpy as np # 1. Create an array with outliers a = np.array([1, 1, 1, 1, 1, 1, 42, 1, 1]) # 2. Determine mean and standard deviation mean = np.mean(a) std_dev = np.std(a) # 3. Normalize array around 0 zero_based = abs(a - mean) # 4. Define maximum number of standard deviations max_deviations = 2 # 5. Access only non-outliers using Boolean Indexing no_outliers = a[zero_based < max_deviations * std_dev] print(no_outliers) # [1 1 1 1 1 1 1 1]
Where to Go From Here?
Enough theory. Let’s get some practice!
Coders get paid six figures and more because they can solve problems more effectively using machine intelligence and automation.
To become more successful in coding, solve more real problems for real people. That’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
You build high-value coding skills by working on practical coding projects!
Do you want to stop learning with toy projects and focus on practical code projects that earn you money and solve real problems for people?
🚀 If your answer is YES!, consider becoming a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.
If you just want to learn about the freelancing opportunity, feel free to watch my free webinar “How to Build Your High-Income Skill Python” and learn how I grew my coding business online and how you can, too—from the comfort of your own home.
Programmer Humor
👱♀️ Programmer 1: We have a problem
🧔♂️ Programmer 2: Let’s use RegEx!
👱♀️ Programmer 1: Now we have two problems
… yet – you can easily reduce the two problems to zero as you polish your “RegEx Superpower in Python“. 🙂
Python One-Liners Book: Master the Single Line First!
Python programmers will improve their computer science skills with these useful one-liners.
Python One-Liners will teach you how to read and write “one-liners”: concise statements of useful functionality packed into a single line of code. You’ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.
The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.
Detailed explanations of one-liners introduce key computer science concepts and boost your coding and analytical skills. You’ll learn about advanced Python features such as list comprehension, slicing, lambda functions, regular expressions, map and reduce functions, and slice assignments.
You’ll also learn how to:
- Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
- Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
- Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
- Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
- Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting
By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.