Implementation:Neuml Txtai Late Pooling
| Knowledge Sources | |
|---|---|
| Domains | Late Interaction, ColBERT, Multi-Vector Embeddings |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for late interaction pooling with ColBERT-style multi-vector embeddings provided by txtai.
Description
The LatePooling class extends Pooling to implement late interaction pooling, which produces per-token embeddings (multi-vector representations) rather than a single pooled vector per document. This approach is used by ColBERT and similar models where relevance is computed via token-level interactions between queries and documents. The class loads a linear projection layer from safetensors weights (supporting both PyLate and Stanford ColBERT formats), applies it after the transformer's hidden state output, and handles query/document prefixes and per-category maximum lengths. During post-encoding, output vectors are L2-normalized and zero-padded to uniform length across the batch. An optional Muvera encoder can be configured to reduce the multi-vector output to a single fixed-dimensional vector for efficient retrieval. Model settings (prefixes and max lengths) are read from either PyLate's config_sentence_transformers.json or Stanford's artifact.metadata format.
Usage
Use LatePooling for ColBERT-style late interaction retrieval where token-level matching provides higher accuracy than single-vector approaches. It is automatically selected by PoolingFactory when the model has a 1_Dense/config.json file or an "HF_ColBERT" architecture. Enable the MUVERA encoder via modelargs to convert multi-vector outputs to single fixed vectors for use with standard ANN indexes.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File:
src/python/txtai/models/pooling/late.py
Signature
class LatePooling(Pooling):
def __init__(self, path, device, tokenizer=None, maxlength=None, loadprompts=None, modelargs=None)
def forward(self, **inputs)
def preencode(self, documents, category)
def postencode(self, results, category)
def settings(self, path, config)
Import
from txtai.models.pooling.late import LatePooling
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to a late interaction model on Hugging Face Hub or local filesystem. Must contain safetensors weights with a linear.weight tensor.
|
| device | int or str | Yes | Tensor device id for model placement. |
| tokenizer | str | No | Optional path to a custom tokenizer. |
| maxlength | int | No | Default max sequence length (may be overridden per-category by model settings). |
| loadprompts | bool | No | Whether to load instruction prompts. |
| modelargs | dict | No | Additional model arguments. Supports a muvera key with MUVERA configuration (set to {} for defaults, None to disable).
|
| documents | list of str | Yes (for encode) | Input documents to encode into multi-vector or fixed-vector embeddings. |
| category | str | No | "query" or "data" - controls prefix, max length, and MUVERA aggregation behavior. |
Outputs
| Name | Type | Description |
|---|---|---|
| forward() | torch.Tensor | Token-level embeddings after linear projection, shape (batch_size, seq_length, projection_dim). |
| encode() | numpy.ndarray | 3D array of shape (num_documents, max_seq_length, projection_dim) with L2-normalized, zero-padded multi-vector embeddings. When MUVERA is enabled, returns a 2D array of shape (num_documents, muvera_output_dim). |
| preencode() | list | Documents with query/document prefixes applied and maxlength adjusted per category. |
| postencode() | list or numpy.ndarray | L2-normalized and padded results; optionally transformed to fixed vectors via MUVERA. |
| settings() | list | A 4-element list: [query_prefix, query_length, document_prefix, document_length]. |
Usage Examples
from txtai.models.pooling.late import LatePooling
# Create a late interaction model (ColBERT)
model = LatePooling(
path="colbert-ir/colbertv2.0",
device="cpu",
modelargs={"muvera": None} # Disable MUVERA, use raw multi-vectors
)
# Encode documents (multi-vector output)
docs = ["Machine learning fundamentals", "Neural network architectures"]
doc_embeddings = model.encode(docs, category="data")
# doc_embeddings.shape: (2, max_tokens, projection_dim)
# Encode query
query_embedding = model.encode(["What is deep learning?"], category="query")
# With MUVERA enabled for single-vector output
model_muvera = LatePooling(
path="colbert-ir/colbertv2.0",
device="cpu",
modelargs={"muvera": {"repetitions": 20, "hashes": 5, "projection": 16}}
)
fixed_embeddings = model_muvera.encode(docs, category="data")
# fixed_embeddings.shape: (2, 10240)