5 Best Ways to Create a Sample Dataset Using Python Scikit-Learn

πŸ’‘ Problem Formulation: When developing machine learning models, having a versatile sample dataset is crucial for testing and training purposes. In this article, we’ll learn how to quickly generate such datasets using Python’s Scikit-Learn library. For instance, we may require a dataset with features following a normal distribution and a categorical target for classification problems.

Method 1: make_classification

The make_classification function is a versatile tool for creating synthetic classified datasets. It can closely mimic complex data structures by allowing users to specify the number of features, classes, clusters per class, and much more.

Here’s an example:

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=10, n_classes=2)
print('Sample features:\n', X[:5])
print('Sample labels:\n', y[:5])

Output:

Sample features:
 [[ 2.63063428 -0.42724472 ... -1.20604538]
 ...
 [ 0.18275291  1.743515  ...  0.67128767]]
Sample labels:
 [1 0 1 1 0]

This code generates a dataset with 100 samples, each having 10 features labeled into 2 classes. Inspecting the first five rows, we observe that the synthetic dataset indeed carries the structure of a typical classified dataset, with continuous-valued features and binary labels.

Method 2: make_regression

For regression models, the make_regression function creates a linear regression dataset. It can be tailored with specified numbers of features, noise levels, and coefficients.

Here’s an example:

from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
print('Sample features:\n', X[:3])
print('Sample target:\n', y[:3])

Output:

Sample features:
 [[ 0.00620588 -1.21263903]
 [ 0.08827347 -1.76464177]
 [ 0.58608018  0.4621307 ]]
Sample target:
 [ 10.25630865 -17.41977562  21.94721037]

The output here is a simple regression problem with 100 examples and 2 features, with a slight noise introduced to make the data more realistic.

Method 3: make_blobs

The make_blobs function is excellent for clustering algorithms, producing isotropic Gaussian blobs for clustering. You can control the centers, cluster standard deviation, and number of features.

Here’s an example:

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=100, n_features=2, centers=3)
print('Sample features:\n', X[:3])
print('Sample labels (cluster assignment):\n', y[:3])

Output:

Sample features:
 [[-7.640995    4.33320628]
 [ 0.37444344  9.62738234]
 [ 1.76594649 -2.56155317]]
Sample labels (cluster assignment):
 [2 1 0]

The generated dataset contains 100 samples distributed across three blobs, suitable for testing clustering algorithms. The labels indicate the blob or cluster each sample belongs to.

Method 4: make_circles and make_moons

For datasets that require more complex decision boundaries, make_circles and make_moons generate circular and crescent-shaped data structures, respectively. Ideal for testing non-linear models.

Here’s an example:

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=100, noise=0.1)
print('Sample features:\n', X[:3])
print('Sample labels:\n', y[:3])

Output:

Sample features:
 [[ 1.99138934 -0.16531331]
 [ 0.16672816  0.71874373]
 [ 1.60101714 -0.4103524 ]]
Sample labels:
 [1 1 1]

This code produces 100 samples where points form two interlocking crescent shapes. It is a good challenge for models that need to capture complex patterns.

Bonus One-Liner Method 5: make_swiss_roll

Lastly, make_swiss_roll is a quick way to generate 3D swiss roll datasets for manifold learning or visualizations.

Here’s an example:

from sklearn.datasets import make_swiss_roll

X, _ = make_swiss_roll(n_samples=100)
print('Sample features (first three rows):\n', X[:3])

Output:

Sample features (first three rows):
 [[  9.43615765   4.79308055 -11.43300628]
 ...
 [ 12.47722557 -11.03295446  10.28803112]]

Executing the above line generates a swiss roll shaped dataset comprising 100 3D points, demonstrating the flexibility of Scikit-Learn’s dataset generation functions.

Summary/Discussion

  • Method 1: make_classification. Perfect for simulating complex datasets with multiple classes. It may be overly complicated for simpler binary classification tasks.
  • Method 2: make_regression. Ideal for creating linear relationships with controllable noise. Simpler than real-world datasets, so may not fully test model robustness.
  • Method 3: make_blobs. Suitable for clustering algorithm evaluation. The blobs may not capture non-convex cluster shapes common in real data.
  • Method 4: make_circles and make_moons. Great for non-linear data; however, this is an artificial scenario unlikely to be a perfect representation of real-world data complexity.
  • Method 5: make_swiss_roll. Provides a multi-dimensional structure ideal for manifold learning but may not be practical for standard supervised learning tasks.