Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Haifengl Smile Point Data Formatting

From Leeroopedia


Overview

Point Data Formatting documents the expected data format that Smile's spatial index structures require. This is a Pattern Doc -- it describes how user application code should prepare and format data before passing it to the smile.neighbor package constructors, rather than documenting a single API method.

All Euclidean-space index structures (KDTree, LSH, MPLSH, RandomProjectionForest) require input data as double[][] where each double[] is a point with consistent dimensionality. General metric-space structures (CoverTree, BKTree, LinearSearch) accept arbitrary typed arrays with user-supplied distance functions.

Interface Specification

Primary Format: double[][]

The fundamental data format for Euclidean-space nearest neighbor search is a two-dimensional array of doubles:

double[][] data = new double[n][d];
// n = number of points
// d = number of dimensions (features)
// data[i] = the i-th point as a d-dimensional vector
// data[i][j] = the j-th coordinate of the i-th point

Invariants:

  • Every data[i] must be non-null.
  • Every data[i].length must equal data[0].length (uniform dimensionality).
  • The array must contain at least one point.

Data Sources

Data for nearest neighbor search can come from several sources within Smile and standard Java:

Source Method Output Type Notes
Smile DataFrame DataFrame.toArray() double[][] Converts numeric columns to matrix form
Manual construction Direct array initialization double[][] For programmatic or synthetic data
CSV/file parsing Various I/O utilities double[][] Parse into arrays after reading
Random generation MathEx.randn(n, d) double[][] Generates n points in d dimensions from standard normal

Usage Patterns

Pattern 1: Manual Data Construction

import smile.neighbor.KDTree;

// Create 2D point data manually
double[][] points = {
    {1.0, 2.0},
    {3.0, 4.0},
    {5.0, 6.0},
    {7.0, 8.0}
};

// Points are ready for indexing -- each row is a 2D point
KDTree<double[]> tree = KDTree.of(points);

Pattern 2: Generating Synthetic Data

import smile.math.MathEx;
import smile.neighbor.KDTree;

// Generate 1000 random points in 10-dimensional space
double[][] data = MathEx.randn(1000, 10);

// Data is immediately usable as spatial index input
KDTree<double[]> tree = new KDTree<>(data, data);

Pattern 3: Separating Keys from Values

When the data objects are not the same as the keys (e.g., you have labeled data where the key is a feature vector and the value is a label or object):

import smile.neighbor.KDTree;

double[][] keys = new double[n][d]; // feature vectors
String[] labels = new String[n];    // associated labels

// Populate keys and labels from your data source
for (int i = 0; i < n; i++) {
    keys[i] = extractFeatures(rawData[i]);
    labels[i] = extractLabel(rawData[i]);
}

// KDTree stores keys for spatial indexing, labels as associated data
KDTree<String> tree = new KDTree<>(keys, labels);

Pattern 4: General Metric Space Data

For non-Euclidean metric spaces, data is formatted as typed arrays with an explicit distance function:

import smile.neighbor.CoverTree;
import smile.math.distance.EuclideanDistance;

double[][] points = loadData();

// CoverTree requires explicit Metric -- accepts double[][] via array wrapping
CoverTree<double[], double[]> tree = new CoverTree<>(
    points, points, new EuclideanDistance()
);

Pattern 5: String Data with Discrete Metrics

For discrete metric spaces such as edit distance on strings:

import smile.neighbor.BKTree;
import smile.math.distance.EditDistance;

String[] dictionary = {"hello", "help", "held", "hero", "helm"};

// BKTree for approximate string matching
BKTree<String, String> tree = BKTree.of(dictionary, new EditDistance(true));

Pattern 6: Feature Normalization Before Indexing

import smile.math.MathEx;
import smile.neighbor.KDTree;

double[][] raw = loadRawFeatures(); // features at different scales

// Normalize each feature to zero mean, unit variance
int n = raw.length;
int d = raw[0].length;
double[][] normalized = new double[n][d];

for (int j = 0; j < d; j++) {
    double mean = 0, std = 0;
    for (int i = 0; i < n; i++) mean += raw[i][j];
    mean /= n;
    for (int i = 0; i < n; i++) std += (raw[i][j] - mean) * (raw[i][j] - mean);
    std = Math.sqrt(std / n);
    for (int i = 0; i < n; i++) {
        normalized[i][j] = (std > 0) ? (raw[i][j] - mean) / std : 0.0;
    }
}

// Build index on normalized data
KDTree<double[]> tree = KDTree.of(normalized);

Common Mistakes

Mistake Symptom Fix
Ragged arrays (inconsistent dimensions) ArrayIndexOutOfBoundsException during tree construction Ensure all data[i].length are equal
Null entries in data array NullPointerException during construction Filter out null entries before indexing
Mismatched key/data array lengths IllegalArgumentException: "The array size of keys and data are different" Ensure keys.length == data.length
Unnormalized features with different scales Poor neighbor quality; distance dominated by high-magnitude features Apply min-max or z-score normalization
Very high dimensionality with KDTree Performance degrades to brute-force scan Use LSH, MPLSH, or RandomProjectionForest for d > 20

Source Context

The data format requirements are enforced across all constructors in the smile.neighbor package:

  • KDTree.java (L119-147): Constructor validates key.length == data.length, infers dimensionality from keys[0].length.
  • LSH.java (L91-119): Constructor validates array sizes, enforces d >= 2 for dimensionality.
  • CoverTree.java (L227-271): Constructor validates keys.size() == data.size().
  • LinearSearch.java (L76-116): Constructor validates matching sizes of key and data lists.

Source: base/src/main/java/smile/neighbor/

Related

Categories

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment