Principle:Lance format Lance Vector Index Building
| Knowledge Sources | |
|---|---|
| Domains | Vector_Search, Indexing |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Vector Index Building is the process of training and materializing a vector index structure over a dataset's embedding column, enabling efficient approximate nearest neighbor (ANN) search.
Description
Once a dataset has vector columns and the desired index configuration, the next step is to actually build (train and write) the index. This involves:
- Training -- learning the index's internal structures from the data. For IVF, this means running k-means to find centroids. For PQ, it means learning codebooks. For HNSW, it means building the proximity graph.
- Encoding -- transforming all vectors in the dataset through the trained stages (assigning to partitions, quantizing, inserting into graphs).
- Writing -- persisting the index structures to storage alongside the dataset.
- Committing -- atomically updating the dataset manifest to reference the new index.
Lance uses a builder pattern for index creation, which separates the concerns of configuration from execution. The builder allows optional settings such as a custom name, replacement behavior, selective fragment indexing (for distributed builds), and pre-trained data injection.
The index build process is versioned -- each index creation produces a new dataset version. This ensures that concurrent readers continue to see the previous version while the index is being built.
Usage
Build a vector index when:
- You have added vector columns to a dataset and want to enable fast ANN search.
- You need to rebuild an index after significant data changes (inserts, updates, or deletes).
- You want to replace an existing index with different parameters (e.g., more partitions for a larger dataset).
- You are performing distributed indexing across multiple workers, each handling a subset of fragments.
Theoretical Basis
Index Build Pipeline
The build process follows the configured stages in order:
Raw Vectors --> [IVF Training] --> [Graph/Quantizer Training] --> [Encoding] --> [Write to Storage]
For a typical IVF_HNSW_PQ index:
- IVF Training: Sample
sample_rate * num_partitionsvectors from the dataset. Run k-means for up tomax_itersiterations to learnnum_partitionscentroids. - HNSW Construction: Within each IVF partition, build a multi-layer proximity graph. Each vector is inserted by finding its approximate nearest neighbors at each level and establishing bidirectional edges (up to
mper node). - PQ Training: Sample vectors to learn
num_sub_vectorscodebooks, each with2^num_bitsentries. - PQ Encoding: Encode every vector in the dataset by replacing each sub-vector with its nearest codebook entry index.
- Write: Persist centroids, graph adjacency lists, codebooks, and encoded vectors.
Atomicity and Versioning
The build has two phases:
- execute_uncommitted -- performs all training and writing, producing an
IndexMetadatawith a UUID, field references, and fragment bitmap. - commit -- wraps the metadata in a
CreateIndextransaction and atomically applies it to the dataset.
This two-phase design enables distributed indexing: multiple workers can each call execute_uncommitted on different fragment subsets, then a coordinator merges and commits the results.
Empty Index Creation
When train is set to false (or the dataset is empty), Lance creates an empty index skeleton. This is useful for pre-configuring the index before data arrives, as the index can be populated incrementally later.