π‘ Problem Formulation: When dealing with numerical data in machine learning, certain algorithms can perform poorly if the feature values are on vastly different scales. Feature scaling normalizes the range of variables, leading to better performance during model training. For instance, consider an input dataset where the age feature ranges from 18 to 90, while the salary feature ranges from 20,000 to 120,000. Our goal is to scale these features to a consistent range, such as 0 to 1, to facilitate the algorithm’s ability to learn effectively.
Method 1: Standardization with StandardScaler
Standardization involves transforming data to have a mean of 0 and a standard deviation of 1. The StandardScaler
in Python’s sklearn.preprocessing
module is a widely used technique for standardization. It works well when the features follow a normal distribution and is especially useful in algorithms that assume a Gaussian distribution, such as Support Vector Machines and Logistic Regression.
Here’s an example:
from sklearn.preprocessing import StandardScaler data = [[19, 20000], [35, 120000], [26, 35000]] scaler = StandardScaler() scaled_data = scaler.fit_transform(data) print(scaled_data)
The output of this code will be a numpy array with scaled values:
[[-1.06904497 -1.22474487] [ 1.33630621 1.22474487] [-0.26726124 0. ]]
In this example, the StandardScaler
computes the mean and standard deviation for each feature and applies the transformation. The resulting dataset has transformed features with values representing the number of standard deviations from the mean, effectively standardizing the range across features.
Method 2: Normalization with MinMaxScaler
Normalization rescales the data to a fixed range, usually 0 to 1. The MinMaxScaler
from Python’s sklearn.preprocessing
module scales each feature by subtracting the minimum value and dividing by the range. It’s ideal for when one needs to maintain the structure of the data and is suited for algorithms that do not assume any distribution, such as k-Nearest Neighbors and Neural networks.
Here’s an example:
from sklearn.preprocessing import MinMaxScaler data = [[19, 20000], [35, 120000], [26, 35000]] scaler = MinMaxScaler() scaled_data = scaler.fit_transform(data) print(scaled_data)
The output of this code snippet is:
[[0. 0. ] [1. 1. ] [0.4375 0.16666667]]
This example shows how MinMaxScaler
transforms the data into a range where the minimum value each feature is 0 and the maximum value is 1. This is particularly useful for algorithms sensitive to the scale of data, and for preserving zero entries in sparse data.
Method 3: MaxAbsScaler for Scaling Sparse Data
MaxAbsScaler
scales each feature by its maximum absolute value and is found in the sklearn.preprocessing
module. This scaler is meant for data that is already centered at zero and preserves the sparsity of the dataset. It is useful for when data has large outliers or when we are dealing with sparse data for algorithms like Support Vector Machines with sparse inputs.
Here’s an example:
from sklearn.preprocessing import MaxAbsScaler data = [[19, 20000], [35, 120000], [26, 35000]] scaler = MaxAbsScaler() scaled_data = scaler.fit_transform(data) print(scaled_data)
Output:
[[0.54285714 0.16666667] [1. 1. ] [0.74285714 0.29166667]]
By dividing by the maximum absolute value of each feature, MaxAbsScaler
transforms the data within the range of -1 to 1, which is particularly beneficial for sparse data where we want to maintain the data distribution and zero entries.
Method 4: RobustScaler for Data with Outliers
Outliers can significantly affect the mean and variance of data. RobustScaler
scales features using statistics that are robust to outliers by removing the median and scaling data according to the Interquartile Range (IQR). It is available in the sklearn.preprocessing
module and is particularly effective for datasets with outliers.
Here’s an example:
from sklearn.preprocessing import RobustScaler data = [[19, 20000], [35, 120000], [26, 35000]] scaler = RobustScaler() scaled_data = scaler.fit_transform(data) print(scaled_data)
Output:
[[-0.28 -0.28] [ 1.12 1.12] [ 0. 0. ]]
In this code, RobustScaler
mitigates the influence of outliers by using the median and IQR, which are less sensitive to extreme values. This is essential for datasets where outliers should not influence the scale of the data and where the scale should not be distorted.
Bonus One-Liner Method 5: Quick Rescaling with pandas
For a simple rescaling task, pandas provides a quick method using lambda functions. This one-liner can quickly normalize or standardize a DataFrame column without the need for Scikit-learn when preprocessing is minimal and the need for a fitting process is not required.
Here’s an example:
import pandas as pd df = pd.DataFrame({'Age': [19, 35, 26], 'Salary': [20000, 120000, 35000]}) df_scaled = df.apply(lambda x: (x - x.min()) / (x.max() - x.min())) print(df_scaled)
Output:
Age Salary 0 0.0 0.000000 1 1.0 1.000000 2 0.4375 0.166667
This code snippet uses pandas to apply a lambda function that performs min-max normalization to each column in the DataFrame. While this method doesn’t offer the learnable parameters and persistence of sklearn’s transformers, it’s a quick and convenient way to scale your data on the fly.
Summary/Discussion
- Method 1: StandardScaler. Well-suited for algorithms expecting data with a Gaussian distribution. Not robust to outliers.
- Method 2: MinMaxScaler. Binds data within a range, preserving the shape across features. Can be skewed by outliers.
- Method 3: MaxAbsScaler. Preserves zero entries in sparse datasets, scales data to [-1, 1] range. Requires centered data.
- Method 4: RobustScaler. Robust to outliers using median and IQR for scaling. Ideal for datasets with many outliers.
- Bonus Method 5: Pandas scaling. Useful for quick-and-dirty scaling. Lacks the advanced features of sklearn’s scalers.