Want to calculate the standard deviation of a column in your Pandas DataFrame?
In case you’ve attended your last statistics course a few years ago, let’s quickly recap the definition of variance: it’s the average squared deviation of the list elements from the average value.
You can do this by using the pd.std()
function that calculates the standard deviation along all columns. You can then get the column you’re interested in after the computation.
import pandas as pd # Create your Pandas DataFrame d = {'username': ['Alice', 'Bob', 'Carl'], 'age': [18, 22, 43], 'income': [100000, 98000, 111000]} df = pd.DataFrame(d) print(df)
Your DataFrame looks like this:
username | age | income | |
0 | Alice | 18 | 100000 |
1 | Bob | 22 | 98000 |
2 | Carl | 43 | 111000 |
Here’s how you can calculate the standard deviation of all columns:
print(df.std())
The output is the standard deviation of all columns:
age 13.428825 income 7000.000000 dtype: float64
To get the variance of an individual column, access it using simple indexing:
print(df.std()['age']) # 180.33333333333334
Together, the code looks as follows. Use the interactive shell to play with it!
Standard Deviation in NumPy Library
Python’s package for data science computation NumPy also has great statistics functionality. You can calculate all basic statistics functions such as average, median, variance, and standard deviation on NumPy arrays. Simply import the NumPy library and use the np.var(a)
method to calculate the average value of NumPy array a
.
Here’s the code:
import numpy as np a = np.array([1, 2, 3]) print(np.std(a)) # 0.816496580927726
Where to Go From Here?
Before you can become a data science master, you first need to master Python. Join my free Python email course and receive your daily Python lesson directly in your INBOX. It’s fun!
Join The World’s #1 Python Email Academy [+FREE Cheat Sheets as PDF]