π‘ Problem Formulation: When working with diverse datasets, the varying range of features can negatively impact the performance of machine learning models. Data scaling is paramount in ensuring that each feature contributes equally to the result. For instance, consider a dataset where the age ranges from 18 to 90, while salaries are expressed in the tens of thousands. The objective is to transform this data so all features have comparable scales, improving model accuracy.
Method 1: StandardScaler
StandardScaler is a scaling technique that subtracts the mean value from the feature and then scales it to unit variance. This results in a distribution with a standard deviation equal to 1 and a mean of 0. It is particularly useful in algorithms that assume all features are centered around zero and have variance in the same order.
Here’s an example:
from sklearn.preprocessing import StandardScaler import numpy as np data = np.array([[30, 100000], [35, 200000], [40, 300000]]) scaler = StandardScaler() scaled_data = scaler.fit_transform(data) print(scaled_data)
Output:
[[-1.22474487 -1.22474487] [ 0. 0. ] [ 1.22474487 1.22474487]]
The code snippet creates a StandardScaler
object, fits it to the data, and transforms the data. The output is a scaled version with zero mean and unit variance as expected from standard scaling.
Method 2: MinMaxScaler
MinMaxScaler rescales the data set such that all feature values are in the range [0, 1] or [a, b] if specified. This is achieved by subtracting the minimum value and then dividing by the range. It is a good option when the distribution is not Gaussian or the standard deviation is very small.
Here’s an example:
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() scaled_data = scaler.fit_transform(data) print(scaled_data)
Output:
[[0. 0.] [0.5 0.5] [1. 1.]]
This snippet instantiates a MinMaxScaler
, fits it to the data, and transforms the data. The results show the dataset scaled within the range 0 to 1.
Method 3: MaxAbsScaler
MaxAbsScaler scales each feature by its maximum absolute value. This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It is meant for data that is already centered at zero without outliers.
Here’s an example:
from sklearn.preprocessing import MaxAbsScaler scaler = MaxAbsScaler() scaled_data = scaler.fit_transform(data) print(scaled_data)
Output:
[[0.66666667 0.33333333] [0.77777778 0.66666667] [0.88888889 1. ]]
This code demonstrates MaxAbsScaler
in action, which scales each feature to the range [-1, 1] by dividing by the maximum absolute value of each feature.
Method 4: RobustScaler
RobustScaler is useful when you have data that contains many outliers. This scaler uses the median and the interquartile range to scale the data because these are robust to outliers. The centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers.
Here’s an example:
from sklearn.preprocessing import RobustScaler scaler = RobustScaler() scaled_data = scaler.fit_transform(data) print(scaled_data)
Output:
[[-1. -1.] [ 0. 0.] [ 1. 1.]]
The example provided uses RobustScaler
to scale the dataset. It shows the relative scaling of the features according to the interquartile range, making them less sensitive to outliers.
Bonus One-Liner Method 5: Normalizer
Normalizer scales individual observations (rows) to have unit norm. This type of scaling is important when you want to measure the similarity between vectorized texts, compute cosine similarities, or when you want to model neural networks.
Here’s an example:
from sklearn.preprocessing import Normalizer scaler = Normalizer() scaled_data = scaler.fit_transform(data) print(scaled_data)
Output:
[[0.0003 1. ] [0.000175 1. ] [0.00013333 1. ]]
The code example shows the Normalizer
transforming data such that each row vector has been scaled to have a Euclidean length of 1.
Summary/Discussion
- Method 1: StandardScaler. Ideal for normally distributed data. Assumes mean of zero and equal variance. Not robust to outliers.
- Method 2: MinMaxScaler. Best for when the bounds of the data are known. Scales data to a specific range. Sensitive to outliers.
- Method 3: MaxAbsScaler. Designed for data that is already centered at zero. Results in scaling features between [-1, 1]. Can be sensitive to outliers.
- Method 4: RobustScaler. Most suitable for datasets with outliers. Uses median and IQR for scaling. Retains valuable information about outlier impact.
- Bonus Method 5: Normalizer. Useful for similarity comparisons and neural networks. Scales rows to have unit norm, regardless of the features.