Implementation:Rapidsai Cuml Digits Dataset
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Datasets |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
A hardcoded C++ header containing the Digits handwritten digit recognition dataset (1797 samples, 64 features) as static constant vectors for use in cuML unit tests and benchmarks.
Description
digits.h provides a copy of the scikit-learn Digits dataset (based on the UCI ML hand-written digits dataset) embedded directly as compile-time constant data. The dataset contains 1797 samples of 8x8 pixel images of handwritten digits (0-9), where each pixel intensity is represented as a float value. Each sample is flattened into a 64-dimensional feature vector.
The data is stored in two std::vector<float> constants within the MLCommon::Datasets::Digits namespace:
digits-- A flattened vector of shape 1797 x 64 containing the feature matrix in row-major order. Pixel values range from 0.0 to 16.0.digits_y-- A vector of 1797 integer target values (0 through 9) stored as floats.
Two additional constants provide the dataset dimensions:
n_samples = 1797n_features = 64
Usage
Use this dataset for unit testing multi-class classification algorithms (e.g., SVM, k-NN, random forest), dimensionality reduction methods (e.g., PCA, t-SNE, UMAP), and clustering algorithms. The 64-dimensional feature space makes it suitable for testing algorithms that operate on moderate-dimensional data.
Code Reference
Source Location
- Repository: Rapidsai_Cuml
- File:
cpp/src_prims/datasets/digits.h
Signature
namespace MLCommon {
namespace Datasets {
namespace Digits {
const std::vector<float> digits = { /* 1797 * 64 = 115008 float values */ };
const std::vector<float> digits_y = { /* 1797 float values */ };
static const int n_samples = 1797;
static const int n_features = 64;
} // namespace Digits
} // namespace Datasets
} // namespace MLCommon
Import
#include <datasets/digits.h>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | -- | -- | This is a static data header with no runtime inputs. |
Outputs
| Name | Type | Description |
|---|---|---|
| digits | const std::vector<float> | Flattened feature matrix of shape (1797, 64), pixel values 0-16 |
| digits_y | const std::vector<float> | Multi-class target vector of length 1797 (values 0-9) |
| n_samples | int | Number of samples (1797) |
| n_features | int | Number of features (64) |
Usage Examples
#include <datasets/digits.h>
// Access the dataset
const auto& X = MLCommon::Datasets::Digits::digits;
const auto& y = MLCommon::Datasets::Digits::digits_y;
int n = MLCommon::Datasets::Digits::n_samples; // 1797
int p = MLCommon::Datasets::Digits::n_features; // 64
// Copy to device memory for classification or dimensionality reduction
rmm::device_uvector<float> d_X(n * p, stream);
rmm::device_uvector<float> d_y(n, stream);
raft::update_device(d_X.data(), X.data(), n * p, stream);
raft::update_device(d_y.data(), y.data(), n, stream);