Implementation:Neuml Txtai SparseVectors Base
| Knowledge Sources | |
|---|---|
| Domains | Embeddings, Vectors, Sparse Vectors, Neural Retrieval |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete base class for sparse vector models that transform input content into sparse arrays, provided by txtai.
Description
The SparseVectors class extends the base Vectors class to provide foundational support for sparse embedding models (e.g., SPLADE). It uses SciPy sparse matrices (CSR format) and scikit-learn utilities for efficient storage and computation.
Key features:
- Sparse encoding: The encode method calls the parent class encode method to generate sparse PyTorch tensors, then converts them to SciPy CSR matrices by extracting coalesced indices and values from the sparse tensor.
- Sparse vector aggregation: The vectors method runs indexing, then rebuilds the full sparse embedding matrix by loading batches from a stream file and stacking them with
scipy.sparse.vstack. - Dot product computation: The dot method uses
safe_sparse_dotfrom scikit-learn for efficient sparse matrix multiplication, returning dense output. - Normalization: Optional L2 normalization via
sklearn.preprocessing.normalize. The default normalization setting is False (sparse vectors typically perform better unnormalized). - SparseArray serialization: Uses txtai's SparseArray utility for loading and saving sparse embeddings.
- Unsupported operations: truncate and quantize raise ValueError since these operations are not meaningful for sparse vectors.
Usage
Use SparseVectors as the base class for any sparse embedding model backend. It is extended by specific sparse model implementations and is used by the Sparse scoring class. This is appropriate when working with learned sparse representations like SPLADE, where the output is a high-dimensional sparse vector rather than a dense embedding.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File:
src/python/txtai/vectors/sparse/base.py
Signature
class SparseVectors(Vectors):
def __init__(self, config, scoring, models)
def encode(self, data, category=None) -> csr_matrix
def vectors(self, documents, batchsize=500, checkpoint=None, buffer=None, dtype=None) -> tuple
def dot(self, queries, data) -> list
def loadembeddings(self, f) -> csr_matrix
def saveembeddings(self, f, embeddings)
def truncate(self, embeddings) # raises ValueError
def normalize(self, embeddings) -> csr_matrix
def quantize(self, embeddings) # raises ValueError
def defaultnormalize(self) -> bool
Import
from txtai.vectors.sparse.base import SparseVectors
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | dict | Yes | Configuration dictionary. Optional key: normalize (bool, default False for sparse vectors). |
| scoring | Scoring | No | Optional scoring instance. |
| models | object | No | Shared models cache instance. |
| data | list[str] | Yes (encode) | List of text strings to encode into sparse vectors. |
| category | str | No | Optional category hint (e.g., "query" or "data"). |
| documents | iterable | Yes (vectors) | Iterable of (id, data, tags) tuples for building the full sparse index. |
| batchsize | int | No | Number of embeddings per batch (default 500). |
| queries | csr_matrix | Yes (dot) | Sparse query matrix for dot product computation. |
| data (dot) | csr_matrix | Yes (dot) | Sparse data matrix for dot product computation. |
Outputs
| Name | Type | Description |
|---|---|---|
| embeddings (encode) | csr_matrix | SciPy CSR matrix of sparse embedding vectors. |
| vectors result | tuple(list, int, csr_matrix) | Tuple of (ids, dimensions, stacked_embeddings) from the full indexing pipeline. |
| dot product | list | Dense dot product results as a nested list. |
| defaultnormalize | bool | Always returns False (sparse vectors default to unnormalized). |
Usage Examples
from txtai.vectors import VectorsFactory
# Create sparse vectors model (typically via factory)
config = {
"method": "splade",
"path": "naver/splade-cocondenser-ensembledistil",
"normalize": False
}
model = VectorsFactory.create(config, None)
# Encode text to sparse vectors
data = ["neural information retrieval", "sparse vector representations"]
embeddings = model.encode(data)
# Returns: SciPy CSR matrix
# Compute dot product similarity
queries = model.encode(["retrieval models"])
scores = model.dot(queries, embeddings)
# Normalize sparse embeddings (if needed)
normalized = model.normalize(embeddings)
# Build full index from documents
documents = [
(0, "sparse retrieval with SPLADE", None),
(1, "dense vs sparse embeddings", None),
]
ids, dimensions, sparse_matrix = model.vectors(iter(documents))