You may have read about the βVβs in Big Data: Volume, Velocity, Variety, Veracity, Value, Volatility.
Variance is yet another important βVβ (it measures Volatility of a data set). In practice, variance is an important measure with important application domains in financial services, weather forecasting, and image processing. Variance measures how much the data spreads around its average in the one- or multi-dimensional space. Youβll see a graphical example in a moment.
In fact, variance is one of the most important properties in machine learning. It captures the patterns of the data in a generalized manner β and machine learning is all about pattern recognition.
Many machine learning algorithms rely on variance in one or the other form. For instance, the bias-variance trade-off is a well-known problem in machine learning: sophisticated machine learning models risk to overfit the data (high variance) but they represent the training data very accurately (low bias). On the other hand, simple models often generalize well (low variance) but do not represent the data accurately (high bias).
The Basics
Variance is a simple statistical property that captures how much the data set spreads from its mean.
Here is an example plot with two data sets: one with low variance, and one with high variance.
The figure exemplifies the stock prices of two companies. The stock price of the tech startup fluctuates heavily around its average. The stock price of the food company is quite stable and fluctuates only in minor ways around the average. In other words, the tech startup has high variance, the food company has low variance.
In mathematical terms, you can calculate the variance var(X)
of a set of numerical values X
using the following formula:
The Code
When getting older, itβs usually good advice to reduce the overall risk of your investment portfolio. According to traditional investment advice, you should consider stocks with lower variance as less risky investment vehicles. You can lose less money when investing in the large company than in the small tech startup. (Letβs not debate the validity of this advice here.)
The goal of this one-liner is to identify the stock in your portfolio that has minimal variance. By investing more money into this stock, you can expect a lower overall variance of your portfolio.
## Dependencies import numpy as np ## Data (rows: stocks / cols: stock prices) X = np.array([[25,27,29,30], [1,5,3,2], [12,11,8,3], [1,1,2,2], [2,6,2,2]]) ## One-liner # Find the stock with smallest variance min_row = min([(i,np.var(X[i,:])) for i in range(len(X))], key=lambda x: x[1]) ## Result & puzzle print("Row with minimum variance: " + str(min_row[0])) print("Variance: " + str(min_row[1]))
Puzzle: Whatβs the output of this code snippet?
The Results
As usual, we first define the data on which we run the one-liner. The NumPy array X
contains five rows (one row per stock in your portfolio) with four values per row (stock prices).
The goal is to find the id and variance of the stock with minimal variance. Hence, the outermost function of the one-liner is the min()
function. We execute the min
function on a sequence of tuples (a,b)
where the first tuple value a
is the row index (stock index) and the second tuple value b
is the variance of the row.
You may ask: whatβs the minimal value of a sequence of tuples? Of course, we need to properly define this operation before using it. To this end, we use the key argument of the min()
function. The key
argument takes a function that returns a comparable object value, given a sequence value. Again, our sequence values are tuples β and we need to find the tuple with minimal variance (the second tuple value). Hence, we return the second tuple value x[1]
as the basis for comparison. In other words, the tuple with the minimal second tuple value wins. This is the tuple with minimal variance.
Letβs have a look how we create the sequence of tuple values.
We use list comprehension to create a tuple for any row index (stock). The first tuple element is simply the index of the row i
. The second tuple element is the variance of this row. We use the NumPy var()
function in combination with slicing to calculate the row variance.
If you want to learn in detail about the NumPy variance function, watch the following video:
Let’s get back to the code. The result of the one-liner is:
""" Row with minimum variance: 3 Variance: 0.25 """
I would like to add that there is an alternative way of solving this problem. If this article wasnβt about Python one-liners, I would prefer the following solution instead of the one-liner:
var = np.var(X, axis=1) min_row = (np.where(var==min(var))[0][0], min(var))
In the first line, we calculate the variance of the NumPy array X
along the columns (axis=1
). In the second line, we create the tuple. The first tuple value is the index of the minimal element in the variance array. The second tuple value it the minimal element in the variance array.
This solution is more readable and makes use of existing implementations that are usually more efficient.
Where to Go from Here?
Do you feel like you need to brush up your Python skills? No problem. Just download my popular Python cheat sheets (tens of thousands of coders have already done this). Print them, and post them to your office wall!
Python One-Liners Book: Master the Single Line First!
Python programmers will improve their computer science skills with these useful one-liners.
Python One-Liners will teach you how to read and write “one-liners”: concise statements of useful functionality packed into a single line of code. You’ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.
The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.
Detailed explanations of one-liners introduce key computer science concepts and boost your coding and analytical skills. You’ll learn about advanced Python features such as list comprehension, slicing, lambda functions, regular expressions, map and reduce functions, and slice assignments.
You’ll also learn how to:
- Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
- Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
- Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
- Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
- Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting
By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.