Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Late Pooling

From Leeroopedia


Knowledge Sources
Domains Late Interaction, ColBERT, Multi-Vector Embeddings
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for late interaction pooling with ColBERT-style multi-vector embeddings provided by txtai.

Description

The LatePooling class extends Pooling to implement late interaction pooling, which produces per-token embeddings (multi-vector representations) rather than a single pooled vector per document. This approach is used by ColBERT and similar models where relevance is computed via token-level interactions between queries and documents. The class loads a linear projection layer from safetensors weights (supporting both PyLate and Stanford ColBERT formats), applies it after the transformer's hidden state output, and handles query/document prefixes and per-category maximum lengths. During post-encoding, output vectors are L2-normalized and zero-padded to uniform length across the batch. An optional Muvera encoder can be configured to reduce the multi-vector output to a single fixed-dimensional vector for efficient retrieval. Model settings (prefixes and max lengths) are read from either PyLate's config_sentence_transformers.json or Stanford's artifact.metadata format.

Usage

Use LatePooling for ColBERT-style late interaction retrieval where token-level matching provides higher accuracy than single-vector approaches. It is automatically selected by PoolingFactory when the model has a 1_Dense/config.json file or an "HF_ColBERT" architecture. Enable the MUVERA encoder via modelargs to convert multi-vector outputs to single fixed vectors for use with standard ANN indexes.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/models/pooling/late.py

Signature

class LatePooling(Pooling):
    def __init__(self, path, device, tokenizer=None, maxlength=None, loadprompts=None, modelargs=None)
    def forward(self, **inputs)
    def preencode(self, documents, category)
    def postencode(self, results, category)
    def settings(self, path, config)

Import

from txtai.models.pooling.late import LatePooling

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to a late interaction model on Hugging Face Hub or local filesystem. Must contain safetensors weights with a linear.weight tensor.
device int or str Yes Tensor device id for model placement.
tokenizer str No Optional path to a custom tokenizer.
maxlength int No Default max sequence length (may be overridden per-category by model settings).
loadprompts bool No Whether to load instruction prompts.
modelargs dict No Additional model arguments. Supports a muvera key with MUVERA configuration (set to {} for defaults, None to disable).
documents list of str Yes (for encode) Input documents to encode into multi-vector or fixed-vector embeddings.
category str No "query" or "data" - controls prefix, max length, and MUVERA aggregation behavior.

Outputs

Name Type Description
forward() torch.Tensor Token-level embeddings after linear projection, shape (batch_size, seq_length, projection_dim).
encode() numpy.ndarray 3D array of shape (num_documents, max_seq_length, projection_dim) with L2-normalized, zero-padded multi-vector embeddings. When MUVERA is enabled, returns a 2D array of shape (num_documents, muvera_output_dim).
preencode() list Documents with query/document prefixes applied and maxlength adjusted per category.
postencode() list or numpy.ndarray L2-normalized and padded results; optionally transformed to fixed vectors via MUVERA.
settings() list A 4-element list: [query_prefix, query_length, document_prefix, document_length].

Usage Examples

from txtai.models.pooling.late import LatePooling

# Create a late interaction model (ColBERT)
model = LatePooling(
    path="colbert-ir/colbertv2.0",
    device="cpu",
    modelargs={"muvera": None}  # Disable MUVERA, use raw multi-vectors
)

# Encode documents (multi-vector output)
docs = ["Machine learning fundamentals", "Neural network architectures"]
doc_embeddings = model.encode(docs, category="data")
# doc_embeddings.shape: (2, max_tokens, projection_dim)

# Encode query
query_embedding = model.encode(["What is deep learning?"], category="query")

# With MUVERA enabled for single-vector output
model_muvera = LatePooling(
    path="colbert-ir/colbertv2.0",
    device="cpu",
    modelargs={"muvera": {"repetitions": 20, "hashes": 5, "projection": 16}}
)

fixed_embeddings = model_muvera.encode(docs, category="data")
# fixed_embeddings.shape: (2, 10240)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment