Implementation:Haifengl Smile SparseDataset

Knowledge Sources	Haifengl_Smile
Domains	Data Structures, Sparse Data, Machine Learning
Last Updated	2026-02-08 22:00 GMT

Overview

SparseDataset is a generic dataset class that stores sparse data in List of Lists (LIL) format, where each row is represented as a sparse array of column index and value pairs.

Description

SparseDataset extends SimpleDataset<SparseArray, T> and implements the List of Lists (LIL) sparse matrix storage format. LIL stores one list per row, where each entry stores a column index and value. Entries are kept sorted by column index for faster lookup. This format is optimized for incremental matrix construction and can be converted to Harwell-Boeing column-compressed sparse matrix format (SparseMatrix) for efficient matrix operations.

The class tracks the number of nonzero entries globally and per column, supports L1 and L2 row normalization, and provides static factory methods for constructing datasets from arrays or coordinate triple tuple files.

Usage

Use SparseDataset when working with high-dimensional data where most feature values are zero, such as text mining (bag-of-words), recommender systems, or any application with sparse feature vectors. It is typically used during data loading and preprocessing before converting to a compressed sparse matrix for computation.

Code Reference

Source Location

Repository: Haifengl_Smile
File: base/src/main/java/smile/data/SparseDataset.java
Lines: 1-313

Signature

public class SparseDataset<T> extends SimpleDataset<SparseArray, T> {
    // Constructors
    public SparseDataset(Collection<SampleInstance<SparseArray, T>> data);
    public SparseDataset(Collection<SampleInstance<SparseArray, T>> data, int ncol);

    // Query methods
    public int nz();
    public int nz(int j);
    public int nrow();
    public int ncol();
    public double get(int i, int j);

    // Normalization
    public void unitize();   // L2 normalization
    public void unitize1();  // L1 normalization

    // Conversion
    public SparseMatrix toMatrix();

    // Static factory methods
    public static SparseDataset<Void> of(SparseArray[] data);
    public static SparseDataset<Void> of(SparseArray[] data, int ncol);
    public static SparseDataset<Void> from(Path path) throws IOException, ParseException;
    public static SparseDataset<Void> from(Path path, int arrayIndexOrigin) throws IOException, ParseException;
}

Import

import smile.data.SparseDataset;

I/O Contract

Inputs

Name	Type	Required	Description
data	Collection<SampleInstance<SparseArray, T>>	Yes	Collection of sample instances with sparse array features and optional targets.
ncol	int	No	The number of columns. If omitted, inferred from the maximum column index in data.
path	Path	Yes (for from())	File path to a coordinate triple tuple list file (instanceID, attributeID, value).
arrayIndexOrigin	int	No	Starting index of arrays (0 for C/Java, 1 for Fortran). Defaults to 0.

Outputs

Name	Type	Description
SparseDataset<T>	SparseDataset<T>	A sparse dataset with LIL storage format.
toMatrix()	SparseMatrix	Harwell-Boeing column-compressed sparse matrix representation.
nz()	int	Total number of nonzero entries across the dataset.
get(i, j)	double	The value at row i, column j (0.0 if not set).

Usage Examples

Basic Usage

import smile.data.SparseDataset;
import smile.util.SparseArray;

// Create sparse arrays
SparseArray[] data = new SparseArray[3];
data[0] = new SparseArray();
data[0].set(0, 1.0);
data[0].set(5, 2.5);

data[1] = new SparseArray();
data[1].set(2, 3.0);
data[1].set(5, 1.0);

data[2] = new SparseArray();
data[2].set(0, 0.5);
data[2].set(3, 4.0);

// Create the sparse dataset
SparseDataset<Void> dataset = SparseDataset.of(data);

// Query dimensions
int rows = dataset.nrow();       // 3
int cols = dataset.ncol();       // 6
int nonzeros = dataset.nz();     // 6
double val = dataset.get(0, 5);  // 2.5

// L2 normalize rows
dataset.unitize();

// Convert to compressed sparse matrix
SparseMatrix matrix = dataset.toMatrix();

Loading from File

import smile.data.SparseDataset;
import java.nio.file.Path;

// Load from coordinate triple tuple file
// File format: header (nrow ncol nz), then lines of (instanceID attributeID value)
SparseDataset<Void> dataset = SparseDataset.from(Path.of("data/sparse.txt"));

// Load with Fortran 1-based indexing
SparseDataset<Void> dataset2 = SparseDataset.from(Path.of("data/sparse_fortran.txt"), 1);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment