Implementation:Neuml Txtai IVFSparse ANN
| Knowledge Sources | |
|---|---|
| Domains | Vector_Search, Sparse_Retrieval |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
IVFSparse is an inverted file index for sparse vectors that uses k-means clustering to partition data into blocks for approximate nearest neighbor search.
Description
The IVFSparse class inherits from ANN and implements an inverted file (IVF) index specifically designed for sparse vector data (scipy CSR matrices). It is modeled after the Faiss IVF index and supports many of the same parameters. During indexing, the class uses MiniBatchKMeans to cluster embeddings into partitions, then stores block-max summary vectors as centroids. At search time, it identifies the top candidate clusters via centroid similarity and scans only those blocks, significantly reducing search time for large collections.
The class supports exact search (single cluster mode for small datasets under 5000 elements) and approximate search with configurable nlist (number of clusters) and nprobe (number of clusters to scan at query time). Serialization uses msgpack for metadata and the SparseArray utility for sparse matrix persistence.
Usage
Use IVFSparse when you have sparse vector representations (such as TF-IDF, BM25, or learned sparse embeddings) and need approximate nearest neighbor search. It is the primary sparse ANN backend in txtai and is well-suited for hybrid retrieval systems that combine sparse and dense signals. For collections under 5000 elements, it automatically falls back to exact search.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/ann/sparse/ivfsparse.py
- Lines: 1-377
Signature
class IVFSparse(ANN):
"""
Inverted file (IVF) index with flat vector file storage and sparse array support.
IVFSparse builds an IVF index and enables approximate nearest neighbor (ANN) search.
This index is modeled after Faiss and supports many of the same parameters.
"""
def __init__(self, config):
super().__init__(config)
if not IVFSPARSE:
raise ImportError('IVFSparse is not available - install "ann" extra to enable')
# Cluster centroids, if computed
self.centroids = None
# Cluster id mapping
self.ids = None
# Cluster data blocks - can be a single block with no computed centroids
self.blocks = None
# Deleted ids
self.deletes = None
Import
from txtai.ann.sparse import IVFSparse
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | dict | Yes | Index configuration dictionary containing backend settings and optional IVF-specific keys such as nlist (int), nprobe (int), sample (float), minpoints (int, default 39), and nfeatures (int, default 25)
|
Outputs
| Name | Type | Description |
|---|---|---|
| self.centroids | scipy.sparse.csr_matrix or None | Block-max summary vectors used for cluster selection during search; None when exact search is used |
| self.ids | dict | Mapping of cluster id to list of element ids within that cluster |
| self.blocks | dict | Mapping of cluster id to sparse matrix of embeddings for that cluster |
| self.deletes | list | List of deleted element ids |
Key Methods
index(self, embeddings)
Builds the IVF index from sparse embeddings. Optionally samples a subset for training. Computes k-means clusters (unless the dataset is small enough for exact search), aggregates embeddings into cluster blocks, prunes small clusters below minpoints, and computes block-max centroid vectors. Records the cluster count in build metadata.
append(self, embeddings)
Assigns new embeddings to existing clusters via centroid similarity, extends the cluster id lists and data blocks accordingly, and updates the offset.
delete(self, ids)
Appends ids to the deletes list. Deleted ids are filtered out during search and excluded from the count.
search(self, queries, limit)
Performs batched search using a thread pool. For approximate mode, selects the top nprobe clusters via centroid similarity, then scans the selected blocks in parallel using safe_sparse_dot. Returns [(id, score)] per query.
count(self)
Returns the total number of elements minus the number of deleted ids.
load(self, path)
Loads the IVF index from a binary file containing msgpack-serialized headers, cluster ids, deletes, and SparseArray-encoded centroid and block data.
save(self, path)
Writes the IVF index to a binary file using the msgpack/SparseArray format with header, centroids, cluster id mapping, data blocks, and deletes.
build(self, train, clusters)
Runs MiniBatchKMeans clustering on the training data, finds the closest real data points to each cluster center, filters duplicate centroids, and returns the unique centroid vectors.
aggregate(self, data)
Assigns each embedding to its nearest centroid using L2 distance. Returns a dictionary mapping cluster id to the list of element indices assigned to that cluster.
topn(self, queries, data, limit, deletes=None)
Computes dot product similarity between queries and data, zeros out deleted rows, and returns the top-n indices and scores using argpartition.
scan(self, query, limit, blockids)
Scans the specified blocks for a single query. Stacks block data, computes similarity, filters deletes, and maps internal indices back to global element ids.
Usage Examples
Basic Usage
import numpy as np
from scipy.sparse import random as sparse_random
from txtai.ann.sparse import IVFSparse
# Configuration for IVFSparse backend
config = {
"backend": "ivfsparse",
"ivfsparse": {
"nlist": 16,
"nprobe": 4,
"minpoints": 39,
"sample": 0.5
}
}
# Create sparse embeddings (10000 docs, 30000 vocab, 0.1% density)
embeddings = sparse_random(10000, 30000, density=0.001, format="csr", dtype=np.float32)
# Build the index
ann = IVFSparse(config)
ann.index(embeddings)
# Search with sparse query vectors
queries = sparse_random(2, 30000, density=0.005, format="csr", dtype=np.float32)
results = ann.search(queries, limit=10)
# results: [[(id, score), ...], [(id, score), ...]]
# Save and reload
ann.save("/tmp/ivfsparse.idx")
ann2 = IVFSparse(config)
ann2.load("/tmp/ivfsparse.idx")
print(ann2.count()) # 10000
ann2.close()