Implementation:Haifengl Smile Point Data Formatting

Overview

Point Data Formatting documents the expected data format that Smile's spatial index structures require. This is a Pattern Doc -- it describes how user application code should prepare and format data before passing it to the smile.neighbor package constructors, rather than documenting a single API method.

All Euclidean-space index structures (KDTree, LSH, MPLSH, RandomProjectionForest) require input data as double[][] where each double[] is a point with consistent dimensionality. General metric-space structures (CoverTree, BKTree, LinearSearch) accept arbitrary typed arrays with user-supplied distance functions.

Interface Specification

Primary Format: `double[][]`

The fundamental data format for Euclidean-space nearest neighbor search is a two-dimensional array of doubles:

double[][] data = new double[n][d];
// n = number of points
// d = number of dimensions (features)
// data[i] = the i-th point as a d-dimensional vector
// data[i][j] = the j-th coordinate of the i-th point

Invariants:

Every data[i] must be non-null.
Every data[i].length must equal data[0].length (uniform dimensionality).
The array must contain at least one point.

Data Sources

Data for nearest neighbor search can come from several sources within Smile and standard Java:

Source	Method	Output Type	Notes
Smile DataFrame	`DataFrame.toArray()`	`double[][]`	Converts numeric columns to matrix form
Manual construction	Direct array initialization	`double[][]`	For programmatic or synthetic data
CSV/file parsing	Various I/O utilities	`double[][]`	Parse into arrays after reading
Random generation	`MathEx.randn(n, d)`	`double[][]`	Generates n points in d dimensions from standard normal

Usage Patterns

Pattern 1: Manual Data Construction

import smile.neighbor.KDTree;

// Create 2D point data manually
double[][] points = {
    {1.0, 2.0},
    {3.0, 4.0},
    {5.0, 6.0},
    {7.0, 8.0}
};

// Points are ready for indexing -- each row is a 2D point
KDTree<double[]> tree = KDTree.of(points);

Pattern 2: Generating Synthetic Data

import smile.math.MathEx;
import smile.neighbor.KDTree;

// Generate 1000 random points in 10-dimensional space
double[][] data = MathEx.randn(1000, 10);

// Data is immediately usable as spatial index input
KDTree<double[]> tree = new KDTree<>(data, data);

Pattern 3: Separating Keys from Values

When the data objects are not the same as the keys (e.g., you have labeled data where the key is a feature vector and the value is a label or object):

import smile.neighbor.KDTree;

double[][] keys = new double[n][d]; // feature vectors
String[] labels = new String[n];    // associated labels

// Populate keys and labels from your data source
for (int i = 0; i < n; i++) {
    keys[i] = extractFeatures(rawData[i]);
    labels[i] = extractLabel(rawData[i]);
}

// KDTree stores keys for spatial indexing, labels as associated data
KDTree<String> tree = new KDTree<>(keys, labels);

Pattern 4: General Metric Space Data

For non-Euclidean metric spaces, data is formatted as typed arrays with an explicit distance function:

import smile.neighbor.CoverTree;
import smile.math.distance.EuclideanDistance;

double[][] points = loadData();

// CoverTree requires explicit Metric -- accepts double[][] via array wrapping
CoverTree<double[], double[]> tree = new CoverTree<>(
    points, points, new EuclideanDistance()
);

Pattern 5: String Data with Discrete Metrics

For discrete metric spaces such as edit distance on strings:

import smile.neighbor.BKTree;
import smile.math.distance.EditDistance;

String[] dictionary = {"hello", "help", "held", "hero", "helm"};

// BKTree for approximate string matching
BKTree<String, String> tree = BKTree.of(dictionary, new EditDistance(true));

Pattern 6: Feature Normalization Before Indexing

import smile.math.MathEx;
import smile.neighbor.KDTree;

double[][] raw = loadRawFeatures(); // features at different scales

// Normalize each feature to zero mean, unit variance
int n = raw.length;
int d = raw[0].length;
double[][] normalized = new double[n][d];

for (int j = 0; j < d; j++) {
    double mean = 0, std = 0;
    for (int i = 0; i < n; i++) mean += raw[i][j];
    mean /= n;
    for (int i = 0; i < n; i++) std += (raw[i][j] - mean) * (raw[i][j] - mean);
    std = Math.sqrt(std / n);
    for (int i = 0; i < n; i++) {
        normalized[i][j] = (std > 0) ? (raw[i][j] - mean) / std : 0.0;
    }
}

// Build index on normalized data
KDTree<double[]> tree = KDTree.of(normalized);

Common Mistakes

Mistake	Symptom	Fix
Ragged arrays (inconsistent dimensions)	`ArrayIndexOutOfBoundsException` during tree construction	Ensure all `data[i].length` are equal
Null entries in data array	`NullPointerException` during construction	Filter out null entries before indexing
Mismatched key/data array lengths	`IllegalArgumentException: "The array size of keys and data are different"`	Ensure `keys.length == data.length`
Unnormalized features with different scales	Poor neighbor quality; distance dominated by high-magnitude features	Apply min-max or z-score normalization
Very high dimensionality with KDTree	Performance degrades to brute-force scan	Use LSH, MPLSH, or RandomProjectionForest for d > 20

Source Context

The data format requirements are enforced across all constructors in the smile.neighbor package:

KDTree.java (L119-147): Constructor validates key.length == data.length, infers dimensionality from keys[0].length.
LSH.java (L91-119): Constructor validates array sizes, enforces d >= 2 for dimensionality.
CoverTree.java (L227-271): Constructor validates keys.size() == data.size().
LinearSearch.java (L76-116): Constructor validates matching sizes of key and data lists.

Source: base/src/main/java/smile/neighbor/

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Implementation:Haifengl Smile Point Data Formatting

Overview

Interface Specification

Primary Format: `double[][]`

Data Sources

Usage Patterns

Pattern 1: Manual Data Construction

Pattern 2: Generating Synthetic Data

Pattern 3: Separating Keys from Values

Pattern 4: General Metric Space Data

Pattern 5: String Data with Discrete Metrics

Pattern 6: Feature Normalization Before Indexing

Common Mistakes

Source Context

Related

Categories

Page Connections

Overview

Interface Specification

Primary Format: double[][]

Data Sources

Usage Patterns

Pattern 1: Manual Data Construction

Pattern 2: Generating Synthetic Data

Pattern 3: Separating Keys from Values

Pattern 4: General Metric Space Data

Pattern 5: String Data with Discrete Metrics

Pattern 6: Feature Normalization Before Indexing

Common Mistakes

Source Context

Related

Categories

Page Connections

Primary Format: `double[][]`