Implementation:Neuml Txtai Late Pooling

Knowledge Sources	Neuml_Txtai
Domains	Late Interaction, ColBERT, Multi-Vector Embeddings
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete tool for late interaction pooling with ColBERT-style multi-vector embeddings provided by txtai.

Description

The LatePooling class extends Pooling to implement late interaction pooling, which produces per-token embeddings (multi-vector representations) rather than a single pooled vector per document. This approach is used by ColBERT and similar models where relevance is computed via token-level interactions between queries and documents. The class loads a linear projection layer from safetensors weights (supporting both PyLate and Stanford ColBERT formats), applies it after the transformer's hidden state output, and handles query/document prefixes and per-category maximum lengths. During post-encoding, output vectors are L2-normalized and zero-padded to uniform length across the batch. An optional Muvera encoder can be configured to reduce the multi-vector output to a single fixed-dimensional vector for efficient retrieval. Model settings (prefixes and max lengths) are read from either PyLate's config_sentence_transformers.json or Stanford's artifact.metadata format.

Usage

Use LatePooling for ColBERT-style late interaction retrieval where token-level matching provides higher accuracy than single-vector approaches. It is automatically selected by PoolingFactory when the model has a 1_Dense/config.json file or an "HF_ColBERT" architecture. Enable the MUVERA encoder via modelargs to convert multi-vector outputs to single fixed vectors for use with standard ANN indexes.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/models/pooling/late.py

Signature

class LatePooling(Pooling):
    def __init__(self, path, device, tokenizer=None, maxlength=None, loadprompts=None, modelargs=None)
    def forward(self, **inputs)
    def preencode(self, documents, category)
    def postencode(self, results, category)
    def settings(self, path, config)

Import

from txtai.models.pooling.late import LatePooling

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	Path to a late interaction model on Hugging Face Hub or local filesystem. Must contain safetensors weights with a `linear.weight` tensor.
device	int or str	Yes	Tensor device id for model placement.
tokenizer	str	No	Optional path to a custom tokenizer.
maxlength	int	No	Default max sequence length (may be overridden per-category by model settings).
loadprompts	bool	No	Whether to load instruction prompts.
modelargs	dict	No	Additional model arguments. Supports a `muvera` key with MUVERA configuration (set to `{}` for defaults, `None` to disable).
documents	list of str	Yes (for encode)	Input documents to encode into multi-vector or fixed-vector embeddings.
category	str	No	"query" or "data" - controls prefix, max length, and MUVERA aggregation behavior.

Outputs

Name	Type	Description
forward()	torch.Tensor	Token-level embeddings after linear projection, shape (batch_size, seq_length, projection_dim).
encode()	numpy.ndarray	3D array of shape (num_documents, max_seq_length, projection_dim) with L2-normalized, zero-padded multi-vector embeddings. When MUVERA is enabled, returns a 2D array of shape (num_documents, muvera_output_dim).
preencode()	list	Documents with query/document prefixes applied and maxlength adjusted per category.
postencode()	list or numpy.ndarray	L2-normalized and padded results; optionally transformed to fixed vectors via MUVERA.
settings()	list	A 4-element list: [query_prefix, query_length, document_prefix, document_length].

Usage Examples

from txtai.models.pooling.late import LatePooling

# Create a late interaction model (ColBERT)
model = LatePooling(
    path="colbert-ir/colbertv2.0",
    device="cpu",
    modelargs={"muvera": None}  # Disable MUVERA, use raw multi-vectors
)

# Encode documents (multi-vector output)
docs = ["Machine learning fundamentals", "Neural network architectures"]
doc_embeddings = model.encode(docs, category="data")
# doc_embeddings.shape: (2, max_tokens, projection_dim)

# Encode query
query_embedding = model.encode(["What is deep learning?"], category="query")

# With MUVERA enabled for single-vector output
model_muvera = LatePooling(
    path="colbert-ir/colbertv2.0",
    device="cpu",
    modelargs={"muvera": {"repetitions": 20, "hashes": 5, "projection": 16}}
)

fixed_embeddings = model_muvera.encode(docs, category="data")
# fixed_embeddings.shape: (2, 10240)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment