Implementation:Neuml Txtai SparseVectors Base

Knowledge Sources	Neuml_Txtai
Domains	Embeddings, Vectors, Sparse Vectors, Neural Retrieval
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete base class for sparse vector models that transform input content into sparse arrays, provided by txtai.

Description

The SparseVectors class extends the base Vectors class to provide foundational support for sparse embedding models (e.g., SPLADE). It uses SciPy sparse matrices (CSR format) and scikit-learn utilities for efficient storage and computation.

Key features:

Sparse encoding: The encode method calls the parent class encode method to generate sparse PyTorch tensors, then converts them to SciPy CSR matrices by extracting coalesced indices and values from the sparse tensor.
Sparse vector aggregation: The vectors method runs indexing, then rebuilds the full sparse embedding matrix by loading batches from a stream file and stacking them with scipy.sparse.vstack.
Dot product computation: The dot method uses safe_sparse_dot from scikit-learn for efficient sparse matrix multiplication, returning dense output.
Normalization: Optional L2 normalization via sklearn.preprocessing.normalize. The default normalization setting is False (sparse vectors typically perform better unnormalized).
SparseArray serialization: Uses txtai's SparseArray utility for loading and saving sparse embeddings.
Unsupported operations: truncate and quantize raise ValueError since these operations are not meaningful for sparse vectors.

Usage

Use SparseVectors as the base class for any sparse embedding model backend. It is extended by specific sparse model implementations and is used by the Sparse scoring class. This is appropriate when working with learned sparse representations like SPLADE, where the output is a high-dimensional sparse vector rather than a dense embedding.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/vectors/sparse/base.py

Signature

class SparseVectors(Vectors):
    def __init__(self, config, scoring, models)
    def encode(self, data, category=None) -> csr_matrix
    def vectors(self, documents, batchsize=500, checkpoint=None, buffer=None, dtype=None) -> tuple
    def dot(self, queries, data) -> list
    def loadembeddings(self, f) -> csr_matrix
    def saveembeddings(self, f, embeddings)
    def truncate(self, embeddings)  # raises ValueError
    def normalize(self, embeddings) -> csr_matrix
    def quantize(self, embeddings)  # raises ValueError
    def defaultnormalize(self) -> bool

Import

from txtai.vectors.sparse.base import SparseVectors

I/O Contract

Inputs

Name	Type	Required	Description
config	dict	Yes	Configuration dictionary. Optional key: normalize (bool, default False for sparse vectors).
scoring	Scoring	No	Optional scoring instance.
models	object	No	Shared models cache instance.
data	list[str]	Yes (encode)	List of text strings to encode into sparse vectors.
category	str	No	Optional category hint (e.g., "query" or "data").
documents	iterable	Yes (vectors)	Iterable of (id, data, tags) tuples for building the full sparse index.
batchsize	int	No	Number of embeddings per batch (default 500).
queries	csr_matrix	Yes (dot)	Sparse query matrix for dot product computation.
data (dot)	csr_matrix	Yes (dot)	Sparse data matrix for dot product computation.

Outputs

Name	Type	Description
embeddings (encode)	csr_matrix	SciPy CSR matrix of sparse embedding vectors.
vectors result	tuple(list, int, csr_matrix)	Tuple of (ids, dimensions, stacked_embeddings) from the full indexing pipeline.
dot product	list	Dense dot product results as a nested list.
defaultnormalize	bool	Always returns False (sparse vectors default to unnormalized).

Usage Examples

from txtai.vectors import VectorsFactory

# Create sparse vectors model (typically via factory)
config = {
    "method": "splade",
    "path": "naver/splade-cocondenser-ensembledistil",
    "normalize": False
}
model = VectorsFactory.create(config, None)

# Encode text to sparse vectors
data = ["neural information retrieval", "sparse vector representations"]
embeddings = model.encode(data)
# Returns: SciPy CSR matrix

# Compute dot product similarity
queries = model.encode(["retrieval models"])
scores = model.dot(queries, embeddings)

# Normalize sparse embeddings (if needed)
normalized = model.normalize(embeddings)

# Build full index from documents
documents = [
    (0, "sparse retrieval with SPLADE", None),
    (1, "dense vs sparse embeddings", None),
]
ids, dimensions, sparse_matrix = model.vectors(iter(documents))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment