Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai SparseVectors Base

From Leeroopedia


Knowledge Sources
Domains Embeddings, Vectors, Sparse Vectors, Neural Retrieval
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete base class for sparse vector models that transform input content into sparse arrays, provided by txtai.

Description

The SparseVectors class extends the base Vectors class to provide foundational support for sparse embedding models (e.g., SPLADE). It uses SciPy sparse matrices (CSR format) and scikit-learn utilities for efficient storage and computation.

Key features:

  • Sparse encoding: The encode method calls the parent class encode method to generate sparse PyTorch tensors, then converts them to SciPy CSR matrices by extracting coalesced indices and values from the sparse tensor.
  • Sparse vector aggregation: The vectors method runs indexing, then rebuilds the full sparse embedding matrix by loading batches from a stream file and stacking them with scipy.sparse.vstack.
  • Dot product computation: The dot method uses safe_sparse_dot from scikit-learn for efficient sparse matrix multiplication, returning dense output.
  • Normalization: Optional L2 normalization via sklearn.preprocessing.normalize. The default normalization setting is False (sparse vectors typically perform better unnormalized).
  • SparseArray serialization: Uses txtai's SparseArray utility for loading and saving sparse embeddings.
  • Unsupported operations: truncate and quantize raise ValueError since these operations are not meaningful for sparse vectors.

Usage

Use SparseVectors as the base class for any sparse embedding model backend. It is extended by specific sparse model implementations and is used by the Sparse scoring class. This is appropriate when working with learned sparse representations like SPLADE, where the output is a high-dimensional sparse vector rather than a dense embedding.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/vectors/sparse/base.py

Signature

class SparseVectors(Vectors):
    def __init__(self, config, scoring, models)
    def encode(self, data, category=None) -> csr_matrix
    def vectors(self, documents, batchsize=500, checkpoint=None, buffer=None, dtype=None) -> tuple
    def dot(self, queries, data) -> list
    def loadembeddings(self, f) -> csr_matrix
    def saveembeddings(self, f, embeddings)
    def truncate(self, embeddings)  # raises ValueError
    def normalize(self, embeddings) -> csr_matrix
    def quantize(self, embeddings)  # raises ValueError
    def defaultnormalize(self) -> bool

Import

from txtai.vectors.sparse.base import SparseVectors

I/O Contract

Inputs

Name Type Required Description
config dict Yes Configuration dictionary. Optional key: normalize (bool, default False for sparse vectors).
scoring Scoring No Optional scoring instance.
models object No Shared models cache instance.
data list[str] Yes (encode) List of text strings to encode into sparse vectors.
category str No Optional category hint (e.g., "query" or "data").
documents iterable Yes (vectors) Iterable of (id, data, tags) tuples for building the full sparse index.
batchsize int No Number of embeddings per batch (default 500).
queries csr_matrix Yes (dot) Sparse query matrix for dot product computation.
data (dot) csr_matrix Yes (dot) Sparse data matrix for dot product computation.

Outputs

Name Type Description
embeddings (encode) csr_matrix SciPy CSR matrix of sparse embedding vectors.
vectors result tuple(list, int, csr_matrix) Tuple of (ids, dimensions, stacked_embeddings) from the full indexing pipeline.
dot product list Dense dot product results as a nested list.
defaultnormalize bool Always returns False (sparse vectors default to unnormalized).

Usage Examples

from txtai.vectors import VectorsFactory

# Create sparse vectors model (typically via factory)
config = {
    "method": "splade",
    "path": "naver/splade-cocondenser-ensembledistil",
    "normalize": False
}
model = VectorsFactory.create(config, None)

# Encode text to sparse vectors
data = ["neural information retrieval", "sparse vector representations"]
embeddings = model.encode(data)
# Returns: SciPy CSR matrix

# Compute dot product similarity
queries = model.encode(["retrieval models"])
scores = model.dot(queries, embeddings)

# Normalize sparse embeddings (if needed)
normalized = model.normalize(embeddings)

# Build full index from documents
documents = [
    (0, "sparse retrieval with SPLADE", None),
    (1, "dense vs sparse embeddings", None),
]
ids, dimensions, sparse_matrix = model.vectors(iter(documents))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment