Implementation:Haifengl Smile SparseDataset
| Knowledge Sources | |
|---|---|
| Domains | Data Structures, Sparse Data, Machine Learning |
| Last Updated | 2026-02-08 22:00 GMT |
Overview
SparseDataset is a generic dataset class that stores sparse data in List of Lists (LIL) format, where each row is represented as a sparse array of column index and value pairs.
Description
SparseDataset extends SimpleDataset<SparseArray, T> and implements the List of Lists (LIL) sparse matrix storage format. LIL stores one list per row, where each entry stores a column index and value. Entries are kept sorted by column index for faster lookup. This format is optimized for incremental matrix construction and can be converted to Harwell-Boeing column-compressed sparse matrix format (SparseMatrix) for efficient matrix operations.
The class tracks the number of nonzero entries globally and per column, supports L1 and L2 row normalization, and provides static factory methods for constructing datasets from arrays or coordinate triple tuple files.
Usage
Use SparseDataset when working with high-dimensional data where most feature values are zero, such as text mining (bag-of-words), recommender systems, or any application with sparse feature vectors. It is typically used during data loading and preprocessing before converting to a compressed sparse matrix for computation.
Code Reference
Source Location
- Repository: Haifengl_Smile
- File: base/src/main/java/smile/data/SparseDataset.java
- Lines: 1-313
Signature
public class SparseDataset<T> extends SimpleDataset<SparseArray, T> {
// Constructors
public SparseDataset(Collection<SampleInstance<SparseArray, T>> data);
public SparseDataset(Collection<SampleInstance<SparseArray, T>> data, int ncol);
// Query methods
public int nz();
public int nz(int j);
public int nrow();
public int ncol();
public double get(int i, int j);
// Normalization
public void unitize(); // L2 normalization
public void unitize1(); // L1 normalization
// Conversion
public SparseMatrix toMatrix();
// Static factory methods
public static SparseDataset<Void> of(SparseArray[] data);
public static SparseDataset<Void> of(SparseArray[] data, int ncol);
public static SparseDataset<Void> from(Path path) throws IOException, ParseException;
public static SparseDataset<Void> from(Path path, int arrayIndexOrigin) throws IOException, ParseException;
}
Import
import smile.data.SparseDataset;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | Collection<SampleInstance<SparseArray, T>> | Yes | Collection of sample instances with sparse array features and optional targets. |
| ncol | int | No | The number of columns. If omitted, inferred from the maximum column index in data. |
| path | Path | Yes (for from()) | File path to a coordinate triple tuple list file (instanceID, attributeID, value). |
| arrayIndexOrigin | int | No | Starting index of arrays (0 for C/Java, 1 for Fortran). Defaults to 0. |
Outputs
| Name | Type | Description |
|---|---|---|
| SparseDataset<T> | SparseDataset<T> | A sparse dataset with LIL storage format. |
| toMatrix() | SparseMatrix | Harwell-Boeing column-compressed sparse matrix representation. |
| nz() | int | Total number of nonzero entries across the dataset. |
| get(i, j) | double | The value at row i, column j (0.0 if not set). |
Usage Examples
Basic Usage
import smile.data.SparseDataset;
import smile.util.SparseArray;
// Create sparse arrays
SparseArray[] data = new SparseArray[3];
data[0] = new SparseArray();
data[0].set(0, 1.0);
data[0].set(5, 2.5);
data[1] = new SparseArray();
data[1].set(2, 3.0);
data[1].set(5, 1.0);
data[2] = new SparseArray();
data[2].set(0, 0.5);
data[2].set(3, 4.0);
// Create the sparse dataset
SparseDataset<Void> dataset = SparseDataset.of(data);
// Query dimensions
int rows = dataset.nrow(); // 3
int cols = dataset.ncol(); // 6
int nonzeros = dataset.nz(); // 6
double val = dataset.get(0, 5); // 2.5
// L2 normalize rows
dataset.unitize();
// Convert to compressed sparse matrix
SparseMatrix matrix = dataset.toMatrix();
Loading from File
import smile.data.SparseDataset;
import java.nio.file.Path;
// Load from coordinate triple tuple file
// File format: header (nrow ncol nz), then lines of (instanceID attributeID value)
SparseDataset<Void> dataset = SparseDataset.from(Path.of("data/sparse.txt"));
// Load with Fortran 1-based indexing
SparseDataset<Void> dataset2 = SparseDataset.from(Path.of("data/sparse_fortran.txt"), 1);