Implementation:Rapidsai Cuml Make Blobs
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Synthetic_Data_Generation |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
Generates synthetic clustered datasets on the GPU, equivalent to scikit-learn's sklearn.datasets.make_blobs.
Description
The ML::Datasets::make_blobs function creates isotropic Gaussian blobs for clustering benchmarks and testing. It generates a feature matrix and corresponding label array on-device. Callers can control the number of samples, features, and clusters, optionally providing pre-defined cluster centers and per-cluster standard deviations. The function supports both row-major and column-major output layouts, shuffling of data and labels, and configurable bounding boxes for randomly generated cluster centers.
Four overloads are provided covering all combinations of single/double precision and int/int64_t label types, allowing flexible integration with downstream code that may use different index types.
Usage
Use this function to generate synthetic clustered data on the GPU for testing clustering algorithms (K-Means, DBSCAN, etc.), benchmarking, or prototyping. It provides a fast GPU-native alternative to scikit-learn's make_blobs for CUDA-based workflows.
Code Reference
Source Location
- Repository: Rapidsai_Cuml
- File:
cpp/include/cuml/datasets/make_blobs.hpp
Signature
namespace ML {
namespace Datasets {
void make_blobs(const raft::handle_t& handle,
float* out,
int64_t* labels,
int64_t n_rows,
int64_t n_cols,
int64_t n_clusters,
bool row_major = true,
const float* centers = nullptr,
const float* cluster_std = nullptr,
const float cluster_std_scalar = 1.f,
bool shuffle = true,
float center_box_min = -10.f,
float center_box_max = 10.f,
uint64_t seed = 0ULL);
void make_blobs(const raft::handle_t& handle,
double* out,
int64_t* labels,
int64_t n_rows,
int64_t n_cols,
int64_t n_clusters,
bool row_major = true,
const double* centers = nullptr,
const double* cluster_std = nullptr,
const double cluster_std_scalar = 1.0,
bool shuffle = true,
double center_box_min = -10.0,
double center_box_max = 10.0,
uint64_t seed = 0ULL);
void make_blobs(const raft::handle_t& handle,
float* out,
int* labels,
int n_rows,
int n_cols,
int n_clusters,
bool row_major = true,
const float* centers = nullptr,
const float* cluster_std = nullptr,
const float cluster_std_scalar = 1.f,
bool shuffle = true,
float center_box_min = -10.f,
float center_box_max = 10.0,
uint64_t seed = 0ULL);
void make_blobs(const raft::handle_t& handle,
double* out,
int* labels,
int n_rows,
int n_cols,
int n_clusters,
bool row_major = true,
const double* centers = nullptr,
const double* cluster_std = nullptr,
const double cluster_std_scalar = 1.0,
bool shuffle = true,
double center_box_min = -10.0,
double center_box_max = 10.0,
uint64_t seed = 0ULL);
} // namespace Datasets
} // namespace ML
Import
#include <cuml/datasets/make_blobs.hpp>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| handle | const raft::handle_t& | Yes | cuML handle for GPU resource management |
| n_rows | int64_t / int | Yes | Number of data samples to generate |
| n_cols | int64_t / int | Yes | Number of features per sample |
| n_clusters | int64_t / int | Yes | Number of clusters (classes) to generate |
| row_major | bool | No (default true) | Whether output is stored in row-major layout |
| centers | const float*/double* | No (default nullptr) | Pre-defined cluster centers on device [n_clusters x n_cols]; nullptr for random generation |
| cluster_std | const float*/double* | No (default nullptr) | Per-cluster standard deviations on device [n_clusters]; nullptr to use cluster_std_scalar |
| cluster_std_scalar | float/double | No (default 1.0) | Uniform standard deviation for all clusters (used when cluster_std is nullptr) |
| shuffle | bool | No (default true) | Whether to shuffle the generated data and labels |
| center_box_min | float/double | No (default -10.0) | Minimum value for randomly generated cluster centers |
| center_box_max | float/double | No (default 10.0) | Maximum value for randomly generated cluster centers |
| seed | uint64_t | No (default 0) | Seed for the random number generator |
Outputs
| Name | Type | Description |
|---|---|---|
| out | float*/double* | Device pointer to the generated feature matrix [n_rows x n_cols] |
| labels | int64_t*/int* | Device pointer to the generated label vector [n_rows] |
Usage Examples
#include <cuml/datasets/make_blobs.hpp>
#include <raft/core/handle.hpp>
void generate_clustering_data() {
raft::handle_t handle;
int64_t n_rows = 1000;
int64_t n_cols = 2;
int64_t n_clusters = 5;
// Allocate device memory
float* data;
int64_t* labels;
cudaMalloc(&data, n_rows * n_cols * sizeof(float));
cudaMalloc(&labels, n_rows * sizeof(int64_t));
// Generate 5-cluster blob dataset
ML::Datasets::make_blobs(handle, data, labels,
n_rows, n_cols, n_clusters,
true, // row_major
nullptr, // centers (random)
nullptr, // cluster_std (use scalar)
1.0f, // cluster_std_scalar
true, // shuffle
-10.0f, // center_box_min
10.0f, // center_box_max
42ULL); // seed
handle.sync_stream();
// Use data and labels for clustering experiments...
cudaFree(data);
cudaFree(labels);
}