Implementing Random Projection in Python with scikit-learn

πŸ’‘ Problem Formulation: When working with high-dimensional data, it becomes challenging to visualize, store, and process such data efficiently. Random projection is a method used for dimensionality reduction, which projects the original data onto a lower-dimensional space while preserving the distances between points effectively. This article explores how to perform random projection in Python using the scikit-learn library, transforming a high-dimensional dataset into a lower-dimensionality while aspiring to maintain its pairwise distances as close as possible to the original dataset.

Method 1: Using GaussianRandomProjection

GaussianRandomProjection is a linear dimensionality reduction technique within the scikit-learn library that projects the data to a lower dimension using a random Gaussian distribution. The strength of this method lies in its theoretical backing which assures that the distances between points are preserved well.

Here’s an example:

from sklearn.random_projection import GaussianRandomProjection
from sklearn.datasets import load_digits

# Load example data
digits = load_digits()
X = digits.data

# Create a Gaussian Random Projector
transformer = GaussianRandomProjection(n_components=2)
X_projected = transformer.fit_transform(X)

# See the shape of the projected data
print(X_projected.shape)

The output:

(1797, 2)

This code snippet starts by importing the GaussianRandomProjection class and a sample dataset from scikit-learn. We then instantiate the GaussianRandomProjection class and use it to fit and transform the dataset into two dimensions. This results in reducing the dimensions of the dataset from 64 to 2 while attempting to preserve the dataset’s structure.

Method 2: Using SparseRandomProjection

SparseRandomProjection is an alternative to GaussianRandomProjection that utilizes a sparse matrix, which can lead to faster computation and less memory usage while also preserving the pairwise distances. This approach is particularly useful for very large datasets.

Here’s an example:

from sklearn.random_projection import SparseRandomProjection
from sklearn.datasets import load_digits

# Load dataset
digits = load_digits()
X = digits.data

# Create a Sparse Random Projector
transformer = SparseRandomProjection(n_components=3)
X_projected = transformer.fit_transform(X)

# Output the projected data shape
print(X_projected.shape)

The output:

(1797, 3)

This code snippet demonstrates the usage of SparseRandomProjection from the scikit-learn library. By specifying n_components=3, it reduces the feature space from the original 64 dimensions down to three sparse dimensions, and outputs the new shape of the dataset.

Method 3: Preserving Class Distribution

To ensure that the random projection actually maintains class distribution, one can combine dimensionality reduction techniques with steps such as stratified sampling. This can be crucial for maintaining accuracy in predictive modeling when applying random projection as a preprocessing step.

Here’s an example:

from sklearn.random_projection import GaussianRandomProjection
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Load the dataset
digits = load_digits()
X = digits.data
y = digits.target

# Stratified sampling to preserve class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y
)

# Apply Gaussian Random Projection
transformer = GaussianRandomProjection(n_components=2)
X_train_projected = transformer.fit_transform(X_train)
X_test_projected = transformer.transform(X_test)

# Output shapes of the projected data
print(X_train_projected.shape)
print(X_test_projected.shape)

The output:

(1437, 2)
(360, 2)

The snippet begins with a stratified split of the dataset to ensure class proportions are maintained between training and test sets. Subsequently, GaussianRandomProjection reduces the feature space while maintaining the relative distances between points in the dataset. The projected training and test sets are then outputted to verify their new shapes.

Method 4: Hyperparameter Tuning of Random Projection

Hyperparameter tuning can significantly influence the performance of random projection. Scikit-learn’s random projection modules include parameters that can be adjusted, such as eps (for controling the quality of the embedding) or n_components (to specify the dimension of the projected space).

Here’s an example:

from sklearn.random_projection import SparseRandomProjection
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification

# Create a synthetic dataset
X, _ = make_classification(n_samples=500, n_features=1000, random_state=42)

# Sparse Random Projection with hyperparameter search
param_grid = {
    'n_components': [5, 10, 50, 100],
    'eps': [0.1, 0.3, 0.5],
}
transformer = SparseRandomProjection(random_state=42)
gsearch = GridSearchCV(transformer, param_grid, cv=5)
gsearch.fit(X)

# Best model projection
best_transformer = gsearch.best_estimator_
print(f"Best parameters found: {gsearch.best_params_}")

The output:

Best parameters found: {'eps': 0.3, 'n_components': 50}

By employing GridSearchCV for the hyperparameters of SparseRandomProjection, the code snippet demonstrates how to optimize the random projection method for the best result, as measured by cross-validated accuracy or other metric.

Bonus One-Liner Method 5: Quick Random Projection

For rapidly reducing dimensionality to a predefined lower dimension without considering hyperparameters or optimizing for specific datasets, a quick application of random projection can be done using a one-liner code.

Here’s an example:

from sklearn.random_projection import johnson_lindenstrauss_min_dim

# Choose 'eps' and find the minimum dimension 'n_components' to guarantee the JL lemma
n_components = johnson_lindenstrauss_min_dim(300, eps=0.1)
print(f"Minimum dimensions required: {n_components}")

The output:

Minimum dimensions required: 6578

This quick method utilizes the johnson_lindenstrauss_min_dim function to calculate the minimum number of dimensions required to respect the Johnson-Lindenstrauss lemma given the number of samples and the desired quality of the projection (eps parameter).

Summary/Discussion

  • Method 1: GaussianRandomProjection. Offers a balance between efficiency and accuracy. It might be more computationally taxing for very large datasets due to the use of Gaussian distribution.
  • Method 2: SparseRandomProjection. This is faster and more memory-efficient than GaussianRandomProjection, best suited for very high-dimensional datasets or sparse data.
  • Method 3: Preserving Class Distribution. While not a different method of random projection, using stratified sampling ensures that the underlying class distribution is preserved, beneficial for maintaining model performance.
  • Method 4: Hyperparameter Tuning of Random Projection. Enables optimization of the random projection method for specific applications or dataset requirements.
  • Bonus One-Liner Method 5: Quick Random Projection. Provides a fast heuristic to decide the dimensions for random projection while considering the desired projection quality.