Principle:Neuml Txtai Index Update
| Knowledge Sources | |
|---|---|
| Domains | Semantic_Search, NLP, Information_Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Index update (upsert) is the process of incrementally adding new documents to or modifying existing documents within a vector search index without performing a complete rebuild.
Description
In production environments, document collections are rarely static. New content arrives continuously, existing documents are revised, and outdated entries must be replaced. A full rebuild of the vector index for every change would be prohibitively expensive. Incremental index update (also known as upsert) solves this problem by appending new vectors to the existing ANN structure and updating the associated metadata stores in place.
The upsert operation follows an insert-or-update semantic. When a document with a given identifier already exists in the index, the old entry is first removed from the ANN index, the document database, the scoring index, and the graph. The new version of the document is then vectorized and appended. When the identifier is new, the document is simply added. This approach ensures that the index always reflects the latest state of each document while avoiding the computational cost of rebuilding the entire vector index.
For very large incremental updates, the operation supports checkpointing. A checkpoint saves the progress of the update to disk at regular intervals, allowing the operation to be resumed from the last checkpoint if it is interrupted. This is critical for production systems where reliability is paramount and data volumes may require hours-long update operations.
Usage
Use the index update operation when documents need to be added to or modified in an existing index. This is the preferred method for maintaining live search indexes that receive continuous updates. Use the full index operation instead when building a new index from scratch or when the majority of the corpus has changed.
Theoretical Basis
1. Upsert Semantics: The upsert operation U on an index I with a document set D_new is defined as:
U(I, D_new) = I' such that: - for each d in D_new: if id(d) in I, then replace(I, d); else insert(I, d) - for each d in I \ D_new: d remains unchanged in I'
This provides last-write-wins semantics for existing documents and append behavior for new ones.
2. ANN Append: Unlike index construction, which builds the ANN structure from scratch with global optimization, the append operation adds vectors incrementally:
ANN.append(V_new) such that ANN' = ANN union V_new
This is generally faster than a full rebuild but may gradually degrade search quality because the data structure was not optimized for the new distribution of vectors. Periodic full rebuilds (reindexing) are recommended to restore optimal search quality.
3. Deletion Before Insertion: For documents with existing IDs, the update performs a logical deletion followed by insertion:
update(I, d) = insert(delete(I, id(d)), d)
The deletion marks the old vector as invalid (either by tombstone or physical removal depending on the ANN backend), and the insertion appends the new vector at a fresh offset.
4. Checkpoint Recovery: For long-running updates, progress is periodically persisted:
checkpoint(progress, path) at interval t
If the operation fails at step n, it can be resumed from the last checkpoint c <= n, avoiding redundant work:
resume(U, checkpoint_c) = U starting from step c
5. Consistency Across Components: The upsert must maintain consistency across all index components (ANN, database, scoring, graph, subindexes). Each component is updated atomically with respect to the document being upserted, ensuring that a search executed during an upsert will return either the old version or the new version of a document, never an inconsistent mixture.