In this post, we will go through the basics of Min-Max scaler. Also, we will focus on how to scale specific columns in Pandas DataFrame.
What is a Min-Max Scaler?
Min-Max scaling is a normalization technique that enables us to scale data in a dataset to a specific range using each feature’s minimum and maximum value.
Unlike standard scaling, where data are scaled based on the standard normal distribution(with mean = 0 and standard deviation = 1), the min-max scaler uses each column’s minimum and maximum value to scale the data series.
But why is this even required?
- The scale of data for some features may be significantly different from those of others, which may harm the performance of our models. It is especially the case with algorithms that rely on a measure of distances, such as Neural Networks and KNN.
- It is also helpful for optimising machine learning processes like gradient descent and enables convergence to happen faster.
- It can help improve the performance and speed of the execution of algorithms. Since the data are already scaled-down, complex calculations mainly required to optimise algorithms are faster.
- It can also be helpful when comparing different datasets or models in terms of their performances.
The Min-Max scaler, implemented in
sklearn libraries, has been used in many Machine Learning applications such as computer vision, natural language processing, and speech recognition.
We will use the following
sklearn method to implement this technique on all columns on a panda’s DataFrame.
We will also show how to implement this on specific columns in a dataframe using two methods in this tutorial. I will describe all these below with examples from the Pima Indian diabetes dataset.
Method 1: sklearn.preprocessing MinMaxScaler()
We will use the popular diabetes dataset- the Pima Indian diabetes dataset from UCI to show various ways we can implement min-max scaling.
By far, the easiest way is to use the
sklearn library package and its prepossessing method.
But first, let’s get the data into our dataframe using the pandas library and perform some EDA.
import pandas as pd columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pd.read_csv('pima-indians-diabetes.data.csv', names=columns) data.head()
We have nine columns, with the last being the class that we are trying to predict with our model.
The items with class 1 show that the particular user has diabetes and those with class 0 indicate that the users tested negative for diabetes. The features are not of the same unit or scale.
Take, for example, the first two columns (
plas); it is clear that
preg– which indicates how many times the patient has been pregnant is in unit digit while
plas– which is the plasma glucose of the customer is in tens or hundreds of units.
Let’s describe the data to see the distribution of each column.
Graphically, we can see how the data are dispersed below.
data[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2));
The graphs above clearly show that the features are not of the same scale. However, with
sklearn min-max scaler, we can ensure the columns use the same scale.
Let’s separate the data into input and output first.
# Separating the data into input and output components X = data.drop('class', axis=1) Y = data['class'] # class is the output X.head()
Let us scale all the features to the same scale and a range from 0 to 1 in values using sklearn
from sklearn.preprocessing import MinMaxScaler X_copy = X.copy() #We create a copy so we can still refer to the original dataframe later scaler = MinMaxScaler() X_columns = X.columns X_scaled = pd.DataFrame(scaler.fit_transform(X_copy), columns=X_columns) X_scaled.head()
We can describe the data in
X_scaled to show each column’s minimum and maximum values.
They are now 0 and 1 respectively for all columns, and they are now also of the same scale.
Now let’s assume only a subset of the entire columns is to be scaled. For example, let us consider a situation in which we only need to adjust the columns
plas while other columns retain their scales; how do we do that?
Again we can use the min-max scaler of the
sklearn package to do that as follows:
from sklearn.preprocessing import MinMaxScaler X_copy = X.copy() scaler = MinMaxScaler() X_copy[['preg', 'plas']] = scaler.fit_transform(X_copy[['preg', 'plas']]) X_copy.head()
We can see only
plas are scaled. We can also show that both columns’ minimum and maximum values are 0 and 1, respectively, below.
Method 2: Explicit Calculation
We can write our function to calculate the scaled value of
X as follows. The same calculation is essentially what the min-max scaler method of
sklearn does under the hood.
def scale_column(df,column): column_max = df[column].max() column_min = df[column].min() for i in range(df[column].count()): df.loc[i,column] = (df.loc[i,column] - column_min) / (column_max - column_min)
We create a copy of our dataframe again (we want to keep the original dataframe to show more examples later on).
We then use our method to scale specific columns in the database as follows:
X_copy2 = X.copy() scale_column(X_copy2,'preg') scale_column(X_copy2,'plas') X_copy2.head()
We can see the output is the same as what we got using the
sklearn package above.
We can also describe the dataframe and show the values in both columns that we scaled are now between 0 and 1.
In this post, we have reviewed min-max scaling, why we need it to optimize algorithms, and how to apply min-max scaler to an entire dataset in a pandas data frame.
We also explored scaling specific columns in such a dataframe using a min-max scaler.
We discussed two approaches for this, one focused on the
sklearn package and the other using a predefined function.