This article explains how to calculate basic statistics such as average, standard deviation, and variance
TLDR;
To average a NumPy array x
along an axis, call np.average()
with arguments x
and the axis identifier. For example, np.average(x, axis=1)
averages along axis 1. The outermost dimension has axis identifier β0β, the second-outermost dimension has identifier β1β. Python collapses the identified axis and replaces it with the axis average, which reduces dimensionality of the resulting array by one.
Feel free to watch the video while skimming over the article for maximum learning efficiency:
Graphical Explanation
Hereβs what you want to achieve:
Extracting basic statistics such as average, variance, standard deviation from NumPy arrays and 2D matrices is a critical component for analyzing a wide range of data sets such as financial data, health data, or social media data. With the rise of machine learning and data science, your proficient education of linear algebra operators with NumPy becomes more and more valuable to the marketplace
Code Solution
Here is how you can accomplish this task in NumPy:
import numpy as np x = np.array([[1, 3, 5], [1, 1, 1], [0, 2, 4]]) print(np.average(x, axis=1)) # [3. 1. 2.] print(np.var(x, axis=1)) # [2.66666667 0. 2.66666667] print(np.std(x, axis=1)) # [1.63299316 0. 1.63299316]
Slow Explanation
Next, I’ll
NumPy internally represents data using NumPy arrays (np.array
). These arrays can have an arbitrary number of dimensions. In the figure above, we show a two-dimensional NumPy array but in practice, the array can have much higher dimensionality. You can quickly identify the dimensionality of a NumPy array by counting the number of opening brackets β[β when creating the array. (The more formal alternative would be to use the ndim
property.)
Each dimension has its own axis identifier.
? Rule of thumb: The outermost dimension has the identifier β0β, the second-outermost dimension has the identifier β1β, and so on.
By default, the NumPy average, variance, and standard deviation functions aggregate all the values in a NumPy array to a single value.
Do you want to become a NumPy master? Check out our interactive puzzle book Coffee Break NumPy and boost your data science skills! (Amazon link opens in new tab.)
Simple Average, Variance, Standard Deviation
What happens if you don’t specify any additional argument apart from the NumPy array on which you want to perform the operation (average, variance, standard deviation)?
import numpy as np x = np.array([[1, 3, 5], [1, 1, 1], [0, 2, 4]]) print(np.average(x)) # 2.0 print(np.var(x)) # 2.4444444444444446 print(np.std(x)) # 1.5634719199411433
For example, the simple average of a NumPy array is calculated as follows:
(1+3+5+1+1+1+0+2+4)/9 = 18/9 = 2.0
Calculating Average, Variance, Standard Deviation Along an Axis
However, sometimes you want to calculate these functions along an axis.
For example, you may work at a large financial corporation and want to calculate the average value of a stock price — given a large matrix of stock prices (rows = different stocks, columns = daily stock prices).
Here is how you can do this by specifying the keyword βaxis
β as an argument to the average, variance, and standard deviation functions:
import numpy as np ## Stock Price Data: 5 companies # (row=[price_day_1, price_day_2, ...]) x = np.array([[8, 9, 11, 12], [1, 2, 2, 1], [2, 8, 9, 9], [9, 6, 6, 3], [3, 3, 3, 3]]) avg, var, std = np.average(x, axis=1), np.var(x, axis=1), np.std(x, axis=1) print("Averages: " + str(avg)) print("Variances: " + str(var)) print("Standard Deviations: " + str(std)) """ Averages: [10. 1.5 7. 6. 3. ] Variances: [2.5 0.25 8.5 4.5 0. ] Standard Deviations: [1.58113883 0.5 2.91547595 2.12132034 0. ] """
Note that you want to perform these three functions along the axis=1, i.e., this is the axis that is aggregated to a single value. Hence, the resulting NumPy arrays have a reduced dimensionality.
High-Dimensional Averaging Along An Axis
Of course, you can also perform this averaging along an axis for high-dimensional NumPy arrays. Conceptually, youβll always aggregate the axis you specify as an argument.
Here is an example:
import numpy as np x = np.array([[[1,2], [1,1]], [[1,1], [2,1]], [[1,0], [0,0]]]) print(np.average(x, axis=2)) print(np.var(x, axis=2)) print(np.std(x, axis=2)) """ [[1.5 1. ] [1. 1.5] [0.5 0. ]] [[0.25 0. ] [0. 0.25] [0.25 0. ]] [[0.5 0. ] [0. 0.5] [0.5 0. ]] """
Where to Go From Here?
Solid programming skills are the foundation of your thorough education as a data scientist and machine learning expert. Master Python first!
To
Join more than 55,000 email subscribers and download your personal Python cheat sheets as high-resolution PDFs. Print them, study them, and keep consulting them daily until you master every bit of Python syntax by heart: