The Complete Guide to Min-Max Scaler in Machine Learning with Ease

In this post, we will go through the basics of Min-Max scaler. Also, we will focus on how to scale specific columns in Pandas DataFrame.

What is a Min-Max Scaler?

Min-Max scaling is a normalization technique that enables us to scale data in a dataset to a specific range using each feature’s minimum and maximum value.

Unlike standard scaling, where data are scaled based on the standard normal distribution(with mean = 0 and standard deviation = 1), the min-max scaler uses each column’s minimum and maximum value to scale the data series.

But why is this even required?

The scale of data for some features may be significantly different from those of others, which may harm the performance of our models. It is especially the case with algorithms that rely on a measure of distances, such as Neural Networks and KNN.
It is also helpful for optimising machine learning processes like gradient descent and enables convergence to happen faster.
It can help improve the performance and speed of the execution of algorithms. Since the data are already scaled-down, complex calculations mainly required to optimise algorithms are faster.
It can also be helpful when comparing different datasets or models in terms of their performances.

The Min-Max scaler, implemented in sklearn libraries, has been used in many Machine Learning applications such as computer vision, natural language processing, and speech recognition.

We will use the following sklearn method to implement this technique on all columns on a panda’s DataFrame.

sklearn.preprocessing.MinMaxScaler().fit_transform()

We will also show how to implement this on specific columns in a dataframe using two methods in this tutorial. I will describe all these below with examples from the Pima Indian diabetes dataset.

Method 1: sklearn.preprocessing MinMaxScaler()

We will use the popular diabetes dataset- the Pima Indian diabetes dataset from UCI to show various ways we can implement min-max scaling.

By far, the easiest way is to use the sklearn library package and its prepossessing method.

But first, let’s get the data into our dataframe using the pandas library and perform some EDA.

import pandas as pd
columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv('pima-indians-diabetes.data.csv', names=columns)
data.head()

We have nine columns, with the last being the class that we are trying to predict with our model.

The items with class 1 show that the particular user has diabetes and those with class 0 indicate that the users tested negative for diabetes. The features are not of the same unit or scale.

Take, for example, the first two columns (preg and plas); it is clear that preg– which indicates how many times the patient has been pregnant is in unit digit while plas– which is the plasma glucose of the customer is in tens or hundreds of units.

Let’s describe the data to see the distribution of each column.

data.describe()

Graphically, we can see how the data are dispersed below.

data[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2));

The graphs above clearly show that the features are not of the same scale. However, with sklearn min-max scaler, we can ensure the columns use the same scale.

Let’s separate the data into input and output first.

# Separating the data into input and output components
X = data.drop('class', axis=1)
Y = data['class'] # class is the output
X.head()

Let us scale all the features to the same scale and a range from 0 to 1 in values using sklearn MinMaxScaler below:

from sklearn.preprocessing import MinMaxScaler
X_copy = X.copy() #We create a copy so we can still refer to the original dataframe later
scaler = MinMaxScaler()
X_columns = X.columns
X_scaled = pd.DataFrame(scaler.fit_transform(X_copy), columns=X_columns)
X_scaled.head()

We can describe the data in X_scaled to show each column’s minimum and maximum values.

They are now 0 and 1 respectively for all columns, and they are now also of the same scale.

X_scaled.describe()

Now let’s assume only a subset of the entire columns is to be scaled. For example, let us consider a situation in which we only need to adjust the columns preg and plas while other columns retain their scales; how do we do that?

Again we can use the min-max scaler of the sklearn package to do that as follows:

from sklearn.preprocessing import MinMaxScaler
X_copy = X.copy()
scaler = MinMaxScaler()
X_copy[['preg', 'plas']] = scaler.fit_transform(X_copy[['preg', 'plas']])
X_copy.head()

We can see only preg and plas are scaled. We can also show that both columns’ minimum and maximum values are 0 and 1, respectively, below.

X_copy.describe()

Method 2: Explicit Calculation

We can write our function to calculate the scaled value of X as follows. The same calculation is essentially what the min-max scaler method of sklearn does under the hood.

def scale_column(df,column):
    column_max = df[column].max()
    column_min = df[column].min()
    for i in range(df[column].count()):
        df.loc[i,column] = (df.loc[i,column] - column_min) / (column_max - column_min)

We create a copy of our dataframe again (we want to keep the original dataframe to show more examples later on).

We then use our method to scale specific columns in the database as follows:

X_copy2 = X.copy()
scale_column(X_copy2,'preg')
scale_column(X_copy2,'plas')
X_copy2.head()

We can see the output is the same as what we got using the sklearn package above.

We can also describe the dataframe and show the values in both columns that we scaled are now between 0 and 1.

X_copy2.describe()

Conclusion

In this post, we have reviewed min-max scaling, why we need it to optimize algorithms, and how to apply min-max scaler to an entire dataset in a pandas data frame.

We also explored scaling specific columns in such a dataframe using a min-max scaler.

We discussed two approaches for this, one focused on the sklearn package and the other using a predefined function.