Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Haifengl Smile SparseDataset

From Leeroopedia


Knowledge Sources
Domains Data Structures, Sparse Data, Machine Learning
Last Updated 2026-02-08 22:00 GMT

Overview

SparseDataset is a generic dataset class that stores sparse data in List of Lists (LIL) format, where each row is represented as a sparse array of column index and value pairs.

Description

SparseDataset extends SimpleDataset<SparseArray, T> and implements the List of Lists (LIL) sparse matrix storage format. LIL stores one list per row, where each entry stores a column index and value. Entries are kept sorted by column index for faster lookup. This format is optimized for incremental matrix construction and can be converted to Harwell-Boeing column-compressed sparse matrix format (SparseMatrix) for efficient matrix operations.

The class tracks the number of nonzero entries globally and per column, supports L1 and L2 row normalization, and provides static factory methods for constructing datasets from arrays or coordinate triple tuple files.

Usage

Use SparseDataset when working with high-dimensional data where most feature values are zero, such as text mining (bag-of-words), recommender systems, or any application with sparse feature vectors. It is typically used during data loading and preprocessing before converting to a compressed sparse matrix for computation.

Code Reference

Source Location

Signature

public class SparseDataset<T> extends SimpleDataset<SparseArray, T> {
    // Constructors
    public SparseDataset(Collection<SampleInstance<SparseArray, T>> data);
    public SparseDataset(Collection<SampleInstance<SparseArray, T>> data, int ncol);

    // Query methods
    public int nz();
    public int nz(int j);
    public int nrow();
    public int ncol();
    public double get(int i, int j);

    // Normalization
    public void unitize();   // L2 normalization
    public void unitize1();  // L1 normalization

    // Conversion
    public SparseMatrix toMatrix();

    // Static factory methods
    public static SparseDataset<Void> of(SparseArray[] data);
    public static SparseDataset<Void> of(SparseArray[] data, int ncol);
    public static SparseDataset<Void> from(Path path) throws IOException, ParseException;
    public static SparseDataset<Void> from(Path path, int arrayIndexOrigin) throws IOException, ParseException;
}

Import

import smile.data.SparseDataset;

I/O Contract

Inputs

Name Type Required Description
data Collection<SampleInstance<SparseArray, T>> Yes Collection of sample instances with sparse array features and optional targets.
ncol int No The number of columns. If omitted, inferred from the maximum column index in data.
path Path Yes (for from()) File path to a coordinate triple tuple list file (instanceID, attributeID, value).
arrayIndexOrigin int No Starting index of arrays (0 for C/Java, 1 for Fortran). Defaults to 0.

Outputs

Name Type Description
SparseDataset<T> SparseDataset<T> A sparse dataset with LIL storage format.
toMatrix() SparseMatrix Harwell-Boeing column-compressed sparse matrix representation.
nz() int Total number of nonzero entries across the dataset.
get(i, j) double The value at row i, column j (0.0 if not set).

Usage Examples

Basic Usage

import smile.data.SparseDataset;
import smile.util.SparseArray;

// Create sparse arrays
SparseArray[] data = new SparseArray[3];
data[0] = new SparseArray();
data[0].set(0, 1.0);
data[0].set(5, 2.5);

data[1] = new SparseArray();
data[1].set(2, 3.0);
data[1].set(5, 1.0);

data[2] = new SparseArray();
data[2].set(0, 0.5);
data[2].set(3, 4.0);

// Create the sparse dataset
SparseDataset<Void> dataset = SparseDataset.of(data);

// Query dimensions
int rows = dataset.nrow();       // 3
int cols = dataset.ncol();       // 6
int nonzeros = dataset.nz();     // 6
double val = dataset.get(0, 5);  // 2.5

// L2 normalize rows
dataset.unitize();

// Convert to compressed sparse matrix
SparseMatrix matrix = dataset.toMatrix();

Loading from File

import smile.data.SparseDataset;
import java.nio.file.Path;

// Load from coordinate triple tuple file
// File format: header (nrow ncol nz), then lines of (instanceID attributeID value)
SparseDataset<Void> dataset = SparseDataset.from(Path.of("data/sparse.txt"));

// Load with Fortran 1-based indexing
SparseDataset<Void> dataset2 = SparseDataset.from(Path.of("data/sparse_fortran.txt"), 1);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment