π‘ Problem Formulation: When working with the sklearn digits dataset in machine learning, researchers and practitioners often face the challenge of reducing dimensionality. For visualization or to improve computational efficiency, one may need to reduce the dataset from its original 64 features to just 2 or 3 features. This article discusses how to perform this transformation in Python while preserving as much information as possible. For example, converting a 64-dimensional digit image to a 2D or 3D point.
Method 1: Principal Component Analysis (PCA)
PCA is a statistical method that uses orthogonal transform to convert a set of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This method is widely used for dimensionality reduction in tasks like face recognition and image compression where data visualization in low-dimensional space is required.
Here’s an example:
from sklearn.datasets import load_digits from sklearn.decomposition import PCA digits = load_digits() pca = PCA(n_components=2) # For 2 features pca_digits = pca.fit_transform(digits.data) print(pca_digits)
The output contains the transformed dataset with 2 features for each sample.
This code snippet first loads the digits dataset and applies PCA to reduce its dimensionality to 2 features. The PCA
class from sklearn.decomposition
is initialized with the desired number of components and then fitted and transformed on the digits data.
Method 2: t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE is a non-linear technique for dimensionality reduction that is particularly well suited for visualizing high-dimensional datasets. It works by converting similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.
Here’s an example:
from sklearn.datasets import load_digits from sklearn.manifold import TSNE digits = load_digits() tsne = TSNE(n_components=3) # For 3 features tsne_digits = tsne.fit_transform(digits.data) print(tsne_digits)
The output contains the transformed dataset with 3 features for each sample.
By leveraging the TSNE
class from sklearn.manifold
, this snippet reduces the original digits data to a new dataset with 3 features. The n_components
parameter is set to 3, specifying the desired output dimensions.
Method 3: Linear Discriminant Analysis (LDA)
LDA is a method used in statistics, pattern recognition, and machine learning to find the linear combination of features that best separates two or more classes of events or objects. The resulting combination may be used as a linear classifier or for dimensionality reduction before further classification.
Here’s an example:
from sklearn.datasets import load_digits from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA digits = load_digits() lda = LDA(n_components=2) # For 2 features lda_digits = lda.fit_transform(digits.data, digits.target) print(lda_digits)
The output contains the transformed dataset with 2 features for each sample.
This snippet uses LDA to find a projection of the dataset that maximizes class separability. The transformed dataset, lda_digits
, now has 2 features. This method requires knowledge of class labels, provided by digits.target
.
Method 4: Isomap Embedding
Isomap is a non-linear dimensionality reduction method based on the spectral theory which tries to preserve the geodesic distances in the low-dimensional space. It is typically used for datasets where there is an underlying geometric manifold.
Here’s an example:
from sklearn.datasets import load_digits from sklearn.manifold import Isomap digits = load_digits() isomap = Isomap(n_components=2) isomap_digits = isomap.fit_transform(digits.data) print(isomap_digits)
The output contains the transformed dataset with 2 features for each sample.
In this example, Isomap is used to reduce the original 64-dimensional data to a 2-dimensional dataset, preserving the relative distances as much as possible. Isomap
is initialized with 2 components and then applied to the digits data.
Bonus One-Liner Method 5: Random Projection
Random Projection is a simple and computationally efficient method for reducing dimensionality by projecting the original data onto a randomly selected subspace of lower dimensionality.
Here’s an example:
from sklearn.datasets import load_digits from sklearn.random_projection import GaussianRandomProjection digits = load_digits() grp = GaussianRandomProjection(n_components=3) grp_digits = grp.fit_transform(digits.data) print(grp_digits)
The output contains the transformed dataset with 3 features for each sample.
This concise example uses GaussianRandomProjection
from sklearn’s random_projection
module to reduce the dimensionality of the digits dataset to 3. It randomly projects the data onto a lower-dimensional Gaussian distributed subspace.
Summary/Discussion
- Method 1: PCA. Widely used for linear dimensionality reduction. Great for preserving global structure. May not capture non-linear patterns.
- Method 2: t-SNE. Excellent for visualizing high-dimensional data in lower dimensions. Captures non-linear relationships. Can be computationally expensive and sensitive to hyperparameters.
- Method 3: LDA. Maximizes class separability. Good for data with known labels. Assumes data is normally distributed and classes have identical covariance matrices.
- Method 4: Isomap Embedding. Maintains geodesic distances. Good for manifold learning. Can be slower than other methods for large datasets.
- Bonus Method 5: Random Projection. Fast and simple. Theoretical guarantees on distance preservation. Randomness can lead to variability in results.