Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai IVFSparse ANN

From Leeroopedia
Revision as of 16:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Neuml_Txtai_IVFSparse_ANN.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Vector_Search, Sparse_Retrieval
Last Updated 2026-02-09 17:00 GMT

Overview

IVFSparse is an inverted file index for sparse vectors that uses k-means clustering to partition data into blocks for approximate nearest neighbor search.

Description

The IVFSparse class inherits from ANN and implements an inverted file (IVF) index specifically designed for sparse vector data (scipy CSR matrices). It is modeled after the Faiss IVF index and supports many of the same parameters. During indexing, the class uses MiniBatchKMeans to cluster embeddings into partitions, then stores block-max summary vectors as centroids. At search time, it identifies the top candidate clusters via centroid similarity and scans only those blocks, significantly reducing search time for large collections.

The class supports exact search (single cluster mode for small datasets under 5000 elements) and approximate search with configurable nlist (number of clusters) and nprobe (number of clusters to scan at query time). Serialization uses msgpack for metadata and the SparseArray utility for sparse matrix persistence.

Usage

Use IVFSparse when you have sparse vector representations (such as TF-IDF, BM25, or learned sparse embeddings) and need approximate nearest neighbor search. It is the primary sparse ANN backend in txtai and is well-suited for hybrid retrieval systems that combine sparse and dense signals. For collections under 5000 elements, it automatically falls back to exact search.

Code Reference

Source Location

Signature

class IVFSparse(ANN):
    """
    Inverted file (IVF) index with flat vector file storage and sparse array support.

    IVFSparse builds an IVF index and enables approximate nearest neighbor (ANN) search.

    This index is modeled after Faiss and supports many of the same parameters.
    """

    def __init__(self, config):
        super().__init__(config)

        if not IVFSPARSE:
            raise ImportError('IVFSparse is not available - install "ann" extra to enable')

        # Cluster centroids, if computed
        self.centroids = None

        # Cluster id mapping
        self.ids = None

        # Cluster data blocks - can be a single block with no computed centroids
        self.blocks = None

        # Deleted ids
        self.deletes = None

Import

from txtai.ann.sparse import IVFSparse

I/O Contract

Inputs

Name Type Required Description
config dict Yes Index configuration dictionary containing backend settings and optional IVF-specific keys such as nlist (int), nprobe (int), sample (float), minpoints (int, default 39), and nfeatures (int, default 25)

Outputs

Name Type Description
self.centroids scipy.sparse.csr_matrix or None Block-max summary vectors used for cluster selection during search; None when exact search is used
self.ids dict Mapping of cluster id to list of element ids within that cluster
self.blocks dict Mapping of cluster id to sparse matrix of embeddings for that cluster
self.deletes list List of deleted element ids

Key Methods

index(self, embeddings)

Builds the IVF index from sparse embeddings. Optionally samples a subset for training. Computes k-means clusters (unless the dataset is small enough for exact search), aggregates embeddings into cluster blocks, prunes small clusters below minpoints, and computes block-max centroid vectors. Records the cluster count in build metadata.

append(self, embeddings)

Assigns new embeddings to existing clusters via centroid similarity, extends the cluster id lists and data blocks accordingly, and updates the offset.

delete(self, ids)

Appends ids to the deletes list. Deleted ids are filtered out during search and excluded from the count.

search(self, queries, limit)

Performs batched search using a thread pool. For approximate mode, selects the top nprobe clusters via centroid similarity, then scans the selected blocks in parallel using safe_sparse_dot. Returns [(id, score)] per query.

count(self)

Returns the total number of elements minus the number of deleted ids.

load(self, path)

Loads the IVF index from a binary file containing msgpack-serialized headers, cluster ids, deletes, and SparseArray-encoded centroid and block data.

save(self, path)

Writes the IVF index to a binary file using the msgpack/SparseArray format with header, centroids, cluster id mapping, data blocks, and deletes.

build(self, train, clusters)

Runs MiniBatchKMeans clustering on the training data, finds the closest real data points to each cluster center, filters duplicate centroids, and returns the unique centroid vectors.

aggregate(self, data)

Assigns each embedding to its nearest centroid using L2 distance. Returns a dictionary mapping cluster id to the list of element indices assigned to that cluster.

topn(self, queries, data, limit, deletes=None)

Computes dot product similarity between queries and data, zeros out deleted rows, and returns the top-n indices and scores using argpartition.

scan(self, query, limit, blockids)

Scans the specified blocks for a single query. Stacks block data, computes similarity, filters deletes, and maps internal indices back to global element ids.

Usage Examples

Basic Usage

import numpy as np
from scipy.sparse import random as sparse_random
from txtai.ann.sparse import IVFSparse

# Configuration for IVFSparse backend
config = {
    "backend": "ivfsparse",
    "ivfsparse": {
        "nlist": 16,
        "nprobe": 4,
        "minpoints": 39,
        "sample": 0.5
    }
}

# Create sparse embeddings (10000 docs, 30000 vocab, 0.1% density)
embeddings = sparse_random(10000, 30000, density=0.001, format="csr", dtype=np.float32)

# Build the index
ann = IVFSparse(config)
ann.index(embeddings)

# Search with sparse query vectors
queries = sparse_random(2, 30000, density=0.005, format="csr", dtype=np.float32)
results = ann.search(queries, limit=10)
# results: [[(id, score), ...], [(id, score), ...]]

# Save and reload
ann.save("/tmp/ivfsparse.idx")
ann2 = IVFSparse(config)
ann2.load("/tmp/ivfsparse.idx")

print(ann2.count())  # 10000

ann2.close()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment