This article explains how to calculate basic statistics (average, standard deviation, and variance)
Here’s what you want to achieve:
Extracting basic statistics from matrices (e.g. average, variance, standard deviation) is a critical component for analyzing a wide range of data sets such as financial data, health data, or social media data. With the rise of machine learning and data science, your proficient education of linear algebra operators with NumPy becomes more and more valuable to the marketplace
Here is how you can accomplish this task in NumPy:
import numpy as np x = np.array([[1, 3, 5], [1, 1, 1], [0, 2, 4]]) print(np.average(x, axis=1)) # [3. 1. 2.] print(np.var(x, axis=1)) # [2.66666667 0. 2.66666667] print(np.std(x, axis=1)) # [1.63299316 0. 1.63299316]
NumPy internally represents data using NumPy arrays (np.array). These arrays can have an arbitrary number of dimensions. In the figure above, we show a two-dimensional NumPy array but in practice, the array can have much higher dimensionality. You can quickly identify the dimensionality of a NumPy array by counting the number of opening brackets “[“ when creating the array. (The more formal alternative would be to use the ndim property.)
Each dimension has its own axis identifier. As a rule of thumb: the outermost dimension has the identifier “0”, the second-outermost dimension has the identifier “1”, and so on.
By default, the NumPy average, variance, and standard deviation functions aggregate all the values in a NumPy array to a single value:
Simple Average, Variance, Standard Deviation
What happens if you don’t specify any additional argument apart from the NumPy array on which you want to perform the operation (average, variance, standard deviation)?
import numpy as np x = np.array([[1, 3, 5], [1, 1, 1], [0, 2, 4]]) print(np.average(x)) # 2.0 print(np.var(x)) # 2.4444444444444446 print(np.std(x)) # 1.5634719199411433
For example, the simple average of a NumPy array is calculated as follows:
(1+3+5+1+1+1+0+2+4)/9 = 18/9 = 2.0
Calculating Average, Variance, Standard Deviation Along an Axis
However, sometimes you want to calculate these functions along an axis.
For example, you may work at a large financial corporation and want to calculate the average value of a stock price — given a large matrix of stock prices (rows = different stocks, columns = daily stock prices).
Here is how you can do this by specifying the keyword “axis” as an argument to the average, variance, and standard deviation functions:
import numpy as np ## Stock Price Data: 5 companies # (row=[price_day_1, price_day_2, ...]) x = np.array([[8, 9, 11, 12], [1, 2, 2, 1], [2, 8, 9, 9], [9, 6, 6, 3], [3, 3, 3, 3]]) avg, var, std = np.average(x, axis=1), np.var(x, axis=1), np.std(x, axis=1) print("Averages: " + str(avg)) print("Variances: " + str(var)) print("Standard Deviations: " + str(std)) """ Averages: [10. 1.5 7. 6. 3. ] Variances: [2.5 0.25 8.5 4.5 0. ] Standard Deviations: [1.58113883 0.5 2.91547595 2.12132034 0. ] """
Note that you want to perform these three functions along the axis=1, i.e., this is the axis that is aggregated to a single value. Hence, the resulting NumPy arrays have a reduced dimensionality.
High-dimensional Averaging Along An Axis
Of course, you can also perform this averaging along an axis for high-dimensional NumPy arrays. Conceptually, you’ll always aggregate the axis you specify as an argument.
Here is an example:
import numpy as np x = np.array([[[1,2], [1,1]], [[1,1], [2,1]], [[1,0], [0,0]]]) print(np.average(x, axis=2)) print(np.var(x, axis=2)) print(np.std(x, axis=2)) """ [[1.5 1. ] [1. 1.5] [0.5 0. ]] [[0.25 0. ] [0. 0.25] [0.25 0. ]] [[0.5 0. ] [0. 0.5] [0.5 0. ]] """
Where to go from here?
Solid programming skills are the foundation of your thorough education as a data scientist and machine learning expert. Master Python first!
Join more than 5,000 email subscribers and download your personal Python cheat sheets as high-resolution PDFs. Print them, study them, and keep consulting them daily until you master every bit of Python syntax by heart.