Principle:Lance format Lance Vector Index Building

Knowledge Sources	Lance Lance Docs
Domains	Vector_Search, Indexing
Last Updated	2026-02-08 19:00 GMT

Overview

Vector Index Building is the process of training and materializing a vector index structure over a dataset's embedding column, enabling efficient approximate nearest neighbor (ANN) search.

Description

Once a dataset has vector columns and the desired index configuration, the next step is to actually build (train and write) the index. This involves:

Training -- learning the index's internal structures from the data. For IVF, this means running k-means to find centroids. For PQ, it means learning codebooks. For HNSW, it means building the proximity graph.
Encoding -- transforming all vectors in the dataset through the trained stages (assigning to partitions, quantizing, inserting into graphs).
Writing -- persisting the index structures to storage alongside the dataset.
Committing -- atomically updating the dataset manifest to reference the new index.

Lance uses a builder pattern for index creation, which separates the concerns of configuration from execution. The builder allows optional settings such as a custom name, replacement behavior, selective fragment indexing (for distributed builds), and pre-trained data injection.

The index build process is versioned -- each index creation produces a new dataset version. This ensures that concurrent readers continue to see the previous version while the index is being built.

Usage

Build a vector index when:

You have added vector columns to a dataset and want to enable fast ANN search.
You need to rebuild an index after significant data changes (inserts, updates, or deletes).
You want to replace an existing index with different parameters (e.g., more partitions for a larger dataset).
You are performing distributed indexing across multiple workers, each handling a subset of fragments.

Theoretical Basis

Index Build Pipeline

The build process follows the configured stages in order:

Raw Vectors --> [IVF Training] --> [Graph/Quantizer Training] --> [Encoding] --> [Write to Storage]

For a typical IVF_HNSW_PQ index:

IVF Training: Sample sample_rate * num_partitions vectors from the dataset. Run k-means for up to max_iters iterations to learn num_partitions centroids.
HNSW Construction: Within each IVF partition, build a multi-layer proximity graph. Each vector is inserted by finding its approximate nearest neighbors at each level and establishing bidirectional edges (up to m per node).
PQ Training: Sample vectors to learn num_sub_vectors codebooks, each with 2^num_bits entries.
PQ Encoding: Encode every vector in the dataset by replacing each sub-vector with its nearest codebook entry index.
Write: Persist centroids, graph adjacency lists, codebooks, and encoded vectors.

Atomicity and Versioning

The build has two phases:

execute_uncommitted -- performs all training and writing, producing an IndexMetadata with a UUID, field references, and fragment bitmap.
commit -- wraps the metadata in a CreateIndex transaction and atomically applies it to the dataset.

This two-phase design enables distributed indexing: multiple workers can each call execute_uncommitted on different fragment subsets, then a coordinator merges and commits the results.

Empty Index Creation

When train is set to false (or the dataset is empty), Lance creates an empty index skeleton. This is useful for pre-configuring the index before data arrives, as the index can be populated incrementally later.

Related Pages

Implemented By

Implementation:Lance_format_Lance_CreateIndexBuilder

Uses Heuristic

Heuristic:Lance_format_Lance_Vector_Index_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment