5 Best Ways to Perform Dimensionality Reduction Using Python’s Scikit-Learn

πŸ’‘ Problem Formulation: In machine learning, dealing with high-dimensional data can be problematic due to increased computational costs and the curse of dimensionality. Dimensionality reduction is a technique used to reduce the number of features in a dataset while attempting to retain the meaningful information. For instance, you might have a dataset with 100 features (input) and wish to simplify it to 10 features (desired output), without losing critical patterns that affect predictions.

Method 1: Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The function PCA() from scikit-learn helps to achieve this in Python.

Here’s an example:

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X_iris = pd.DataFrame(iris.data, columns=iris.feature_names)

# Create PCA object and fit transform the dataset
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_iris)

# Show the reduced data
print(X_pca[:5])

Output:

[[-2.68412563,  0.31939725],
 [-2.71414169, -0.17700123],
 [-2.88899057, -0.14494943],
 [-2.74534286, -0.31829898],
 [-2.72871654,  0.32675451]]

The code above imports PCA from scikit-learn, loads the Iris dataset which consists of 4 features, and reduces its dimensionality to 2 principal components. It then prints the first 5 instances of transformed data. The resulting output represents the dataset in a 2D feature space, potentially ready for further analysis or visualization.

Method 2: t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization developed by Laurens van der Maaten and Geoffrey Hinton. It is a non-linear technique particularly well-suited for embedding high-dimensional data into a space of two or three dimensions. The function TSNE() provides an easy way to perform t-SNE in Scikit-Learn.

Here’s an example:

from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

# Load digits dataset
digits = load_digits()
X_digits = digits.data

# Perform t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_digits)

# Show the embedded data
print(X_tsne[:5])

Output:

[[ -5.281551 -28.952768],
 [ -26.105896  -68.06932 ],
 [ -42.503582   35.58039 ],
 [  55.55176   -31.489283],
 [   6.563     -44.572716]]

This example demonstrates the use of t-SNE on the digits dataset. The TSNE() object is created, specifying the reduced dimensionality to two. After the fit_transform() process, it outputs the 2D representation of each high-dimensional digit data point, ready for visualization.

Method 3: Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is another linear technique for dimensionality reduction and differs from PCA by attempting to maximize separability among known categories. The function LDA() can be used in scikit-learn for reducing the dimensions of a dataset while retaining the class-discriminatory information.

Here’s an example:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import load_wine

# Load the wine dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

# Perform LDA
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_wine, y_wine)

# Show the reduced data
print(X_lda[:5])

Output:

[[ 3.31675081 -1.44346263],
 [ 2.20946492  0.33339289],
 [ 2.51674015 -1.0311513 ],
 [ 3.75706561 -2.75637191],
 [ 1.00890849 -0.86983082]]

LDA is applied to the wine dataset, which consists of 13 different chemical constituents found in three types of wines. The goal is to reduce the dimensions such that the separation between the three wine classes is maximized. The output shows the new two-dimensional space where classes are expected to be better separable.

Method 4: Feature Agglomeration

Feature Agglomeration is a clustering-based approach to dimensionality reduction. It uses hierarchical clustering to group features that are similar, thus reducing the dimensionality of data. The FeatureAgglomeration() function from scikit-learn allows us to apply this method.

Here’s an example:

from sklearn.cluster import FeatureAgglomeration

# Generate synthetic data
X_synth = np.random.rand(100, 10)

# Perform Feature Agglomeration
agglo = FeatureAgglomeration(n_clusters=2)
X_reduced = agglo.fit_transform(X_synth)

# Show the reduced feature space
print(X_reduced[:5])

Output:

[[0.54832206 0.4309633 ],
 [0.47064537 0.39612613],
 [0.51579926 0.47075605],
 [0.66039718 0.70898438],
 [0.53302917 0.29039325]]

In this example, a synthetic dataset with 10 features is created using numpy, and FeatureAgglomeration() is used to cluster features into two groups. The output shows the new dataset represented by two composite features, created by averaging the features within each cluster.

Bonus One-Liner Method 5: Truncated Singular Value Decomposition (Truncated SVD)

Truncated Singular Value Decomposition (Truncated SVD) is similar to PCA, but designed to work with sparse matrices. It is a linear dimensionality reduction technique using truncated singular value decomposition on the data. The TruncatedSVD() from scikit-learn can perform this method efficiently.

Here’s an example:

from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix

# Create a sparse matrix (for example purposes only)
X_sparse = csr_matrix(X_iris.values)

# Perform Truncated SVD
svd = TruncatedSVD(n_components=2)
X_svd = svd.fit_transform(X_sparse)

# Show the reduced data
print(X_svd[:5])

Output:

[[ 5.91220352  2.30344211],
 [ 5.57207573  1.97383104],
 [ 5.4464847   2.09653267],
 [ 5.43601924  1.87168085],
 [ 5.87506555  2.32934799]]

This one-liner method uses TruncatedSVD() to perform dimensionality reduction on the Iris dataset represented as a sparse matrix. It reduces the dimensionality to two components and shows the truncated representation, suitable for working with data that has many zeros.

Summary/Discussion

  • Method 1: PCA. Strengths: Linear method, widely used, and fast. Weaknesses: Not suitable for nonlinear relationships, assumes that principal components are directions in which the data variably is maximum.
  • Method 2: t-SNE. Strengths: Captures complex non-linear relationships, good for visualization. Weaknesses: Computationally expensive, results not easily interpretable in higher dimensions.
  • Method 3: LDA. Strengths: Linear method, aimed at maximizing class separability. Weaknesses: Suited only for labeled data, can perform poorly if assumptions are violated.
  • Method 4: Feature Agglomeration. Strengths: Good for reducing high-dimension data when features exhibit a hierarchical structure. Weaknesses: Results dependent on the structure of clusters.
  • Bonus Method 5: Truncated SVD. Strengths: Works with sparse data, fast. Weaknesses: Assumes linearity and may miss non-linear relationships.