The z-scores can be used to compare data with different measurements and for normalization of data for machine learning algorithms and comparisons.
π‘ Note: There are different methods to calculate the z-score. The quickest and easiest one is: scipy.stats.zscore()
.
What is the z-score?
The z-score is used for normalization or standardization to make differently scaled variables with different means and categories comparable.
The formula for the z score is easy, so it is not a complicated transformation:
z-score = (datapoint β mean)/standard deviation
The statistical expression is
z = (X β ΞΌ) / Ο
The z-score then tells us how far away the normalized value is from the standardized mean. The mean for the z-score will always be 0 and the variance and standard deviation will be 1. This way, the means of two differently scaled data points are comparable.
This is useful for different measurements of the same item for example comparing measurements like mm and inch or comparing test results with different max scores.
So weβll actually try this on an example.
Example z-score
This term, Frank has reached 48, 33 and 41 points on the tests in math and 82, 98 and 75 points on the tests in English.
π¬ Question: Is Frank better in English than in math?Β
We donβt know because the max points in the math tests are 50 points and 100 for the English tests so we cannot directly compare these results.
But we can test our question with the z-score by normalizing and comparing the means.
First, we load our packages and create a data frame with the test results.
import pandas as pd import NumPy as np import scipy.stats as stats test_scores = pd.DataFrame( {"math":[48, 33, 41], "english":[82, 98, 75]}, index=[1, 2, 3])
The data frame with the test results look like this:
How to Calculate z-scores with Pandas?
To calculate the z-scores in pandas we just apply the formula to our data.Β
z_test_scores = (test_scores-test_scores.mean())/(test_scores.std())
We now normalized over each column and can tell for each test result how much it differs from the standardized mean.
z_test_scores.apply(stats.zscore)
β‘ Important: Pandas calculates the standard deviation per default with an unbiased standard estimator and NumPy does not. This can be adapted with the degree of freedom ddof=0
in pandas to equalize it to NumPy or ddof=1
in NumPy to use the unbiased estimator.
In pandas the default setting is the normalization by N-1 for the calculation of the standard deviation.
For NumPy and scipy.stats.zscore
, which is based on NumPy, the default is 0, so N is the estimator.
Just be aware of where this difference comes from.
How to z-transform in Python with SciPy.Stats?
SciPy has the quickest function available in stats scipy.stats.zscore(data)
. Weβll use this on our test scores.
stats.zscore(test_scores)
This will standardize each column. The output shows slightly different values than in pandas.
Applying the zscore()
function to a pandas data frame will deliver the same results.
z_test_scores.apply(stats.zscore)
If we adapt the delta degrees of freedom to N-1 equal to pandas, we receive the same results as above.
stats.zscore(test_scores, ddof=1)
Output:
To answer the question (in what subject Frank is better this term?) we use the mean of the scores and pass it into the same function.
stats.zscore(test_scores.mean())
This tells us that Frank was better in English than in math!
How to Calculate z-scores with NumPy?
The z-transformation in NumPy works similar to pandas.
First, we turn our data frame into a NumPy array and apply the same formula. We have to pass axis = 0
to receive the same results as with stats.zscores()
, as the default direction in NumPy is different.
test_scores_np = test_scores.to_numpy() z_test_scores_np = (test_scores_np - np.mean(test_scores_np, axis=0)) / np.std(test_scores_np, axis=0)
Output:
How to Calculate z-scores with sklearn Standard Scaler?
For normalization and standardization in machine learning algorithms, Scikit-learn also has a z-transform function called StandardScaler()
.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit_transform(test_scores)
Output:
This will also return an array with the same values.
Summary
We now looked at four different ways to normalize data in Python with the z-score and one of them will surely work for you.