π‘ Problem Formulation: When working with high-dimensional data, it becomes challenging to visualize, store, and process such data efficiently. Random projection is a method used for dimensionality reduction, which projects the original data onto a lower-dimensional space while preserving the distances between points effectively. This article explores how to perform random projection in Python using the scikit-learn library, transforming a high-dimensional dataset into a lower-dimensionality while aspiring to maintain its pairwise distances as close as possible to the original dataset.
Method 1: Using GaussianRandomProjection
GaussianRandomProjection is a linear dimensionality reduction technique within the scikit-learn library that projects the data to a lower dimension using a random Gaussian distribution. The strength of this method lies in its theoretical backing which assures that the distances between points are preserved well.
Here’s an example:
from sklearn.random_projection import GaussianRandomProjection from sklearn.datasets import load_digits # Load example data digits = load_digits() X = digits.data # Create a Gaussian Random Projector transformer = GaussianRandomProjection(n_components=2) X_projected = transformer.fit_transform(X) # See the shape of the projected data print(X_projected.shape)
The output:
(1797, 2)
This code snippet starts by importing the GaussianRandomProjection
class and a sample dataset from scikit-learn. We then instantiate the GaussianRandomProjection
class and use it to fit and transform the dataset into two dimensions. This results in reducing the dimensions of the dataset from 64 to 2 while attempting to preserve the dataset’s structure.
Method 2: Using SparseRandomProjection
SparseRandomProjection is an alternative to GaussianRandomProjection that utilizes a sparse matrix, which can lead to faster computation and less memory usage while also preserving the pairwise distances. This approach is particularly useful for very large datasets.
Here’s an example:
from sklearn.random_projection import SparseRandomProjection from sklearn.datasets import load_digits # Load dataset digits = load_digits() X = digits.data # Create a Sparse Random Projector transformer = SparseRandomProjection(n_components=3) X_projected = transformer.fit_transform(X) # Output the projected data shape print(X_projected.shape)
The output:
(1797, 3)
This code snippet demonstrates the usage of SparseRandomProjection
from the scikit-learn library. By specifying n_components=3
, it reduces the feature space from the original 64 dimensions down to three sparse dimensions, and outputs the new shape of the dataset.
Method 3: Preserving Class Distribution
To ensure that the random projection actually maintains class distribution, one can combine dimensionality reduction techniques with steps such as stratified sampling. This can be crucial for maintaining accuracy in predictive modeling when applying random projection as a preprocessing step.
Here’s an example:
from sklearn.random_projection import GaussianRandomProjection from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split # Load the dataset digits = load_digits() X = digits.data y = digits.target # Stratified sampling to preserve class distribution X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y ) # Apply Gaussian Random Projection transformer = GaussianRandomProjection(n_components=2) X_train_projected = transformer.fit_transform(X_train) X_test_projected = transformer.transform(X_test) # Output shapes of the projected data print(X_train_projected.shape) print(X_test_projected.shape)
The output:
(1437, 2)
(360, 2)
The snippet begins with a stratified split of the dataset to ensure class proportions are maintained between training and test sets. Subsequently, GaussianRandomProjection
reduces the feature space while maintaining the relative distances between points in the dataset. The projected training and test sets are then outputted to verify their new shapes.
Method 4: Hyperparameter Tuning of Random Projection
Hyperparameter tuning can significantly influence the performance of random projection. Scikit-learn’s random projection modules include parameters that can be adjusted, such as eps
(for controling the quality of the embedding) or n_components
(to specify the dimension of the projected space).
Here’s an example:
from sklearn.random_projection import SparseRandomProjection from sklearn.model_selection import GridSearchCV from sklearn.datasets import make_classification # Create a synthetic dataset X, _ = make_classification(n_samples=500, n_features=1000, random_state=42) # Sparse Random Projection with hyperparameter search param_grid = { 'n_components': [5, 10, 50, 100], 'eps': [0.1, 0.3, 0.5], } transformer = SparseRandomProjection(random_state=42) gsearch = GridSearchCV(transformer, param_grid, cv=5) gsearch.fit(X) # Best model projection best_transformer = gsearch.best_estimator_ print(f"Best parameters found: {gsearch.best_params_}")
The output:
Best parameters found: {'eps': 0.3, 'n_components': 50}
By employing GridSearchCV
for the hyperparameters of SparseRandomProjection
, the code snippet demonstrates how to optimize the random projection method for the best result, as measured by cross-validated accuracy or other metric.
Bonus One-Liner Method 5: Quick Random Projection
For rapidly reducing dimensionality to a predefined lower dimension without considering hyperparameters or optimizing for specific datasets, a quick application of random projection can be done using a one-liner code.
Here’s an example:
from sklearn.random_projection import johnson_lindenstrauss_min_dim # Choose 'eps' and find the minimum dimension 'n_components' to guarantee the JL lemma n_components = johnson_lindenstrauss_min_dim(300, eps=0.1) print(f"Minimum dimensions required: {n_components}")
The output:
Minimum dimensions required: 6578
This quick method utilizes the johnson_lindenstrauss_min_dim
function to calculate the minimum number of dimensions required to respect the Johnson-Lindenstrauss lemma given the number of samples and the desired quality of the projection (eps
parameter).
Summary/Discussion
- Method 1: GaussianRandomProjection. Offers a balance between efficiency and accuracy. It might be more computationally taxing for very large datasets due to the use of Gaussian distribution.
- Method 2: SparseRandomProjection. This is faster and more memory-efficient than GaussianRandomProjection, best suited for very high-dimensional datasets or sparse data.
- Method 3: Preserving Class Distribution. While not a different method of random projection, using stratified sampling ensures that the underlying class distribution is preserved, beneficial for maintaining model performance.
- Method 4: Hyperparameter Tuning of Random Projection. Enables optimization of the random projection method for specific applications or dataset requirements.
- Bonus One-Liner Method 5: Quick Random Projection. Provides a fast heuristic to decide the dimensions for random projection while considering the desired projection quality.