This article explains how to calculate basic statistics such as average, standard deviation, and variance
To average a NumPy array
x along an axis, call
np.average() with arguments
x and the axis identifier. For example,
np.average(x, axis=1) averages along axis 1. The outermost dimension has axis identifier “0”, the second-outermost dimension has identifier “1”. Python collapses the identified axis and replaces it with the axis average, which reduces dimensionality of the resulting array by one.
Feel free to watch the video while skimming over the article for maximum learning efficiency:
Here’s what you want to achieve:
Extracting basic statistics such as average, variance, standard deviation from NumPy arrays and 2D matrices is a critical component for analyzing a wide range of data sets such as financial data, health data, or social media data. With the rise of machine learning and data science, your proficient education of linear algebra operators with NumPy becomes more and more valuable to the marketplace
Here is how you can accomplish this task in NumPy:
import numpy as np x = np.array([[1, 3, 5], [1, 1, 1], [0, 2, 4]]) print(np.average(x, axis=1)) # [3. 1. 2.] print(np.var(x, axis=1)) # [2.66666667 0. 2.66666667] print(np.std(x, axis=1)) # [1.63299316 0. 1.63299316]
NumPy internally represents data using NumPy arrays (
np.array). These arrays can have an arbitrary number of dimensions. In the figure above, we show a two-dimensional NumPy array but in practice, the array can have much higher dimensionality. You can quickly identify the dimensionality of a NumPy array by counting the number of opening brackets “[“ when creating the array. (The more formal alternative would be to use the
Each dimension has its own axis identifier.
💡 Rule of thumb: The outermost dimension has the identifier “0”, the second-outermost dimension has the identifier “1”, and so on.
By default, the NumPy average, variance, and standard deviation functions aggregate all the values in a NumPy array to a single value.
Do you want to become a NumPy master? Check out our interactive puzzle book Coffee Break NumPy and boost your data science skills! (Amazon link opens in new tab.)
Simple Average, Variance, Standard Deviation
What happens if you don’t specify any additional argument apart from the NumPy array on which you want to perform the operation (average, variance, standard deviation)?
import numpy as np x = np.array([[1, 3, 5], [1, 1, 1], [0, 2, 4]]) print(np.average(x)) # 2.0 print(np.var(x)) # 2.4444444444444446 print(np.std(x)) # 1.5634719199411433
For example, the simple average of a NumPy array is calculated as follows:
(1+3+5+1+1+1+0+2+4)/9 = 18/9 = 2.0
Calculating Average, Variance, Standard Deviation Along an Axis
However, sometimes you want to calculate these functions along an axis.
For example, you may work at a large financial corporation and want to calculate the average value of a stock price — given a large matrix of stock prices (rows = different stocks, columns = daily stock prices).
Here is how you can do this by specifying the keyword “
axis” as an argument to the average, variance, and standard deviation functions:
import numpy as np ## Stock Price Data: 5 companies # (row=[price_day_1, price_day_2, ...]) x = np.array([[8, 9, 11, 12], [1, 2, 2, 1], [2, 8, 9, 9], [9, 6, 6, 3], [3, 3, 3, 3]]) avg, var, std = np.average(x, axis=1), np.var(x, axis=1), np.std(x, axis=1) print("Averages: " + str(avg)) print("Variances: " + str(var)) print("Standard Deviations: " + str(std)) """ Averages: [10. 1.5 7. 6. 3. ] Variances: [2.5 0.25 8.5 4.5 0. ] Standard Deviations: [1.58113883 0.5 2.91547595 2.12132034 0. ] """
Note that you want to perform these three functions along the axis=1, i.e., this is the axis that is aggregated to a single value. Hence, the resulting NumPy arrays have a reduced dimensionality.
High-Dimensional Averaging Along An Axis
Of course, you can also perform this averaging along an axis for high-dimensional NumPy arrays. Conceptually, you’ll always aggregate the axis you specify as an argument.
Here is an example:
import numpy as np x = np.array([[[1,2], [1,1]], [[1,1], [2,1]], [[1,0], [0,0]]]) print(np.average(x, axis=2)) print(np.var(x, axis=2)) print(np.std(x, axis=2)) """ [[1.5 1. ] [1. 1.5] [0.5 0. ]] [[0.25 0. ] [0. 0.25] [0.25 0. ]] [[0.5 0. ] [0. 0.5] [0.5 0. ]] """
Where to Go From Here?
Solid programming skills are the foundation of your thorough education as a data scientist and machine learning expert. Master Python first!
Join more than 55,000 email subscribers and download your personal Python cheat sheets as high-resolution PDFs. Print them, study them, and keep consulting them daily until you master every bit of Python syntax by heart:
While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.
To help students reach higher levels of Python success, he founded the programming education website Finxter.com. He’s author of the popular programming book Python One-Liners (NoStarch 2020), coauthor of the Coffee Break Python series of self-published books, computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.
His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.