Implementation:Haifengl Smile Point Data Formatting
Overview
Point Data Formatting documents the expected data format that Smile's spatial index structures require. This is a Pattern Doc -- it describes how user application code should prepare and format data before passing it to the smile.neighbor package constructors, rather than documenting a single API method.
All Euclidean-space index structures (KDTree, LSH, MPLSH, RandomProjectionForest) require input data as double[][] where each double[] is a point with consistent dimensionality. General metric-space structures (CoverTree, BKTree, LinearSearch) accept arbitrary typed arrays with user-supplied distance functions.
Interface Specification
Primary Format: double[][]
The fundamental data format for Euclidean-space nearest neighbor search is a two-dimensional array of doubles:
double[][] data = new double[n][d];
// n = number of points
// d = number of dimensions (features)
// data[i] = the i-th point as a d-dimensional vector
// data[i][j] = the j-th coordinate of the i-th point
Invariants:
- Every
data[i]must be non-null. - Every
data[i].lengthmust equaldata[0].length(uniform dimensionality). - The array must contain at least one point.
Data Sources
Data for nearest neighbor search can come from several sources within Smile and standard Java:
| Source | Method | Output Type | Notes |
|---|---|---|---|
| Smile DataFrame | DataFrame.toArray() |
double[][] |
Converts numeric columns to matrix form |
| Manual construction | Direct array initialization | double[][] |
For programmatic or synthetic data |
| CSV/file parsing | Various I/O utilities | double[][] |
Parse into arrays after reading |
| Random generation | MathEx.randn(n, d) |
double[][] |
Generates n points in d dimensions from standard normal |
Usage Patterns
Pattern 1: Manual Data Construction
import smile.neighbor.KDTree;
// Create 2D point data manually
double[][] points = {
{1.0, 2.0},
{3.0, 4.0},
{5.0, 6.0},
{7.0, 8.0}
};
// Points are ready for indexing -- each row is a 2D point
KDTree<double[]> tree = KDTree.of(points);
Pattern 2: Generating Synthetic Data
import smile.math.MathEx;
import smile.neighbor.KDTree;
// Generate 1000 random points in 10-dimensional space
double[][] data = MathEx.randn(1000, 10);
// Data is immediately usable as spatial index input
KDTree<double[]> tree = new KDTree<>(data, data);
Pattern 3: Separating Keys from Values
When the data objects are not the same as the keys (e.g., you have labeled data where the key is a feature vector and the value is a label or object):
import smile.neighbor.KDTree;
double[][] keys = new double[n][d]; // feature vectors
String[] labels = new String[n]; // associated labels
// Populate keys and labels from your data source
for (int i = 0; i < n; i++) {
keys[i] = extractFeatures(rawData[i]);
labels[i] = extractLabel(rawData[i]);
}
// KDTree stores keys for spatial indexing, labels as associated data
KDTree<String> tree = new KDTree<>(keys, labels);
Pattern 4: General Metric Space Data
For non-Euclidean metric spaces, data is formatted as typed arrays with an explicit distance function:
import smile.neighbor.CoverTree;
import smile.math.distance.EuclideanDistance;
double[][] points = loadData();
// CoverTree requires explicit Metric -- accepts double[][] via array wrapping
CoverTree<double[], double[]> tree = new CoverTree<>(
points, points, new EuclideanDistance()
);
Pattern 5: String Data with Discrete Metrics
For discrete metric spaces such as edit distance on strings:
import smile.neighbor.BKTree;
import smile.math.distance.EditDistance;
String[] dictionary = {"hello", "help", "held", "hero", "helm"};
// BKTree for approximate string matching
BKTree<String, String> tree = BKTree.of(dictionary, new EditDistance(true));
Pattern 6: Feature Normalization Before Indexing
import smile.math.MathEx;
import smile.neighbor.KDTree;
double[][] raw = loadRawFeatures(); // features at different scales
// Normalize each feature to zero mean, unit variance
int n = raw.length;
int d = raw[0].length;
double[][] normalized = new double[n][d];
for (int j = 0; j < d; j++) {
double mean = 0, std = 0;
for (int i = 0; i < n; i++) mean += raw[i][j];
mean /= n;
for (int i = 0; i < n; i++) std += (raw[i][j] - mean) * (raw[i][j] - mean);
std = Math.sqrt(std / n);
for (int i = 0; i < n; i++) {
normalized[i][j] = (std > 0) ? (raw[i][j] - mean) / std : 0.0;
}
}
// Build index on normalized data
KDTree<double[]> tree = KDTree.of(normalized);
Common Mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| Ragged arrays (inconsistent dimensions) | ArrayIndexOutOfBoundsException during tree construction |
Ensure all data[i].length are equal
|
| Null entries in data array | NullPointerException during construction |
Filter out null entries before indexing |
| Mismatched key/data array lengths | IllegalArgumentException: "The array size of keys and data are different" |
Ensure keys.length == data.length
|
| Unnormalized features with different scales | Poor neighbor quality; distance dominated by high-magnitude features | Apply min-max or z-score normalization |
| Very high dimensionality with KDTree | Performance degrades to brute-force scan | Use LSH, MPLSH, or RandomProjectionForest for d > 20 |
Source Context
The data format requirements are enforced across all constructors in the smile.neighbor package:
- KDTree.java (L119-147): Constructor validates
key.length == data.length, infers dimensionality fromkeys[0].length. - LSH.java (L91-119): Constructor validates array sizes, enforces
d >= 2for dimensionality. - CoverTree.java (L227-271): Constructor validates
keys.size() == data.size(). - LinearSearch.java (L76-116): Constructor validates matching sizes of key and data lists.
Source: base/src/main/java/smile/neighbor/