You may have read about the ‘V’s in Big Data: Volume, Velocity, Variety, Veracity, Value, Volatility.
Variance is yet another important ‘V’ (it measures Volatility of a data set). In practice, variance is an important measure with important application domains in financial services, weather forecasting, and image processing. Variance measures how much the data spreads around its average in the one- or multi-dimensional space. You’ll see a graphical example in a moment.
In fact, variance is one of the most important properties in machine learning. It captures the patterns of the data in a generalized manner – and machine learning is all about pattern recognition.
Many machine learning algorithms rely on variance in one or the other form. For instance, the bias-variance trade-off is a well-known problem in machine learning: sophisticated machine learning models risk to overfit the data (high variance) but they represent the training data very accurately (low bias). On the other hand, simple models often generalize well (low variance) but do not represent the data accurately (high bias).
Variance is a simple statistical property that captures how much the data set spreads from its mean.
Here is an example plot with two data sets: one with low variance, and one with high variance.
The figure exemplifies the stock prices of two companies. The stock price of the tech startup fluctuates heavily around its average. The stock price of the food company is quite stable and fluctuates only in minor ways around the average. In other words, the tech startup has high variance, the food company has low variance.
In mathematical terms, you can calculate the variance var(X) of a set of numerical values X using the following formula:
When getting older, it’s usually good advice to reduce the overall risk of your investment portfolio. According to traditional investment advice, you should consider stocks with lower variance as less risky investment vehicles. You can lose less money when investing in the large company than in the small tech startup. (Let’s not debate the validity of this advice here.)
The goal of this one-liner is to identify the stock in your portfolio that has minimal variance. By investing more money into this stock, you can expect a lower overall variance of your portfolio.
## Dependencies import numpy as np ## Data (rows: stocks / cols: stock prices) X = np.array([[25,27,29,30], [1,5,3,2], [12,11,8,3], [1,1,2,2], [2,6,2,2]]) ## One-liner # Find the stock with smallest variance min_row = min([(i,np.var(X[i,:])) for i in range(len(X))], key=lambda x: x) ## Result & puzzle print("Row with minimum variance: " + str(min_row)) print("Variance: " + str(min_row))
What’s the output of this code snippet?
As usual, we first define the data on which we run the one-liner. The NumPy array X contains five rows (one row per stock in your portfolio) with four values per row (stock prices).
The goal is to find the id and variance of the stock with minimal variance. Hence, the outermost function of the one-liner is the min() function. We execute the min function on a sequence of tuples (a,b) where the first tuple value a is the row index (stock index) and the second tuple value b is the variance of the row.
You may ask: what’s the minimal value of a sequence of tuples? Of course, we need to properly define this operation before using it. To this end, we use the key argument of the min() function. The key argument takes a function that returns a comparable object value, given a sequence value. Again, our sequence values are tuples – and we need to find the tuple with minimal variance (the second tuple value). Hence, we return the second tuple value x as the basis for comparison. In other words, the tuple with the minimal second tuple value wins. This is the tuple with minimal variance.
Let’s have a look how we create the sequence of tuple values.
We use list comprehension to create a tuple for any row index (stock). The first tuple element is simply the index of the row i. The second tuple element is the variance of this row. We use the NumPy var() function in combination with slicing to calculate the row variance.
If you want to learn in detail about the NumPy variance function, watch the following video:
The result of the one-liner is therefore:
""" Row with minimum variance: 3 Variance: 0.25 """
I would like to add that there is an alternative way of solving this problem. If this article wasn’t about Python one-liners, I would prefer the following solution instead of the one-liner:
var = np.var(X, axis=1) min_row = (np.where(var==min(var)), min(var))
In the first line, we calculate the variance of the NumPy array X along the columns (axis=1). In the second line, we create the tuple. The first tuple value is the index of the minimal element in the variance array. The second tuple value it the minimal element in the variance array.
This solution is more readable and makes use of existing implementations that are usually more efficient.
Where to go from here?
Do you feel like you need to brush up your Python skills? No problem. Just download my popular Python cheat sheets (tens of thousands of coders have already done this). Print them, and post them to your office wall!