Implementation:Neuml Txtai Similarity Pipeline

Knowledge Sources	Neuml_Txtai
Domains	Machine Learning, NLP, Semantic Similarity, Transformers
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete tool for computing semantic similarity between a query and candidate texts using multiple backend strategies provided by txtai.

Description

Similarity extends Labels and serves as a unified similarity interface that supports three backend strategies: zero-shot classification (default, inherited from Labels), cross-encoder scoring (via CrossEncoder), and late interaction scoring (via LateEncoder). The backend is selected at initialization based on the crossencode and lateencode flags. For the zero-shot mode, the query is used as the candidate label and the texts are classified against it, with scores transposed to produce per-query similarity rankings. All modes return results as (id, score) tuples sorted by descending score.

Usage

Use Similarity when you need a flexible similarity pipeline that can switch between zero-shot, cross-encoder, and late interaction backends. It is the primary similarity interface used by other txtai components such as the Reranker pipeline.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/pipeline/text/similarity.py

Signature

class Similarity(Labels):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, dynamic=True, crossencode=False, lateencode=False, **kwargs)
    def __call__(self, query, texts, multilabel=True, **kwargs)
    def encode(self, data, category)

Import

from txtai.pipeline.text.similarity import Similarity

I/O Contract

Inputs

Name	Type	Required	Description
path	str	No	Model path; accepts Hugging Face model hub id or local path.
quantize	bool	No	If True, quantizes the model to int8 (CPU only). Defaults to False.
gpu	bool or int	No	True/False to enable GPU, or a specific GPU device id. Defaults to True.
model	Pipeline	No	Optional existing pipeline model to wrap.
dynamic	bool	No	If True (default), uses zero-shot classification. If False, uses standard text classification.
crossencode	bool	No	If True, uses a cross-encoder backend. Defaults to False.
lateencode	bool	No	If True, uses a late interaction encoder backend (e.g. ColBERT). Defaults to False.
query	str or list	Yes (call)	Query text or list of query texts.
texts	list	Yes (call)	List of candidate text strings to compare against the query.
multilabel	bool or None	No (call)	Score normalization mode. Defaults to True (sigmoid).

Outputs

Name	Type	Description
result	list of (int, float)	List of (id, score) tuples sorted by descending score. If query is a string, returns a 1D list. If query is a list, returns a 2D list with one row per query.

Usage Examples

from txtai.pipeline.text.similarity import Similarity

# Zero-shot similarity (default)
similarity = Similarity()
results = similarity("What is machine learning?", [
    "Machine learning is a type of AI",
    "Python is a programming language",
    "Neural networks process data"
])
# Returns: [(0, 0.92), (2, 0.45), (1, 0.08)]

# Cross-encoder similarity
similarity = Similarity("cross-encoder/ms-marco-MiniLM-L-6-v2", crossencode=True)
results = similarity("What is AI?", ["AI is intelligence", "Python is a language"])

# Late interaction similarity (ColBERT)
similarity = Similarity("colbert-ir/colbertv2.0", lateencode=True)
results = similarity("What is AI?", ["AI is intelligence", "Python is a language"])

# Multiple queries
results = similarity(["What is AI?", "What is Python?"], ["AI is intelligence", "Python is a language"])

Related Pages

Environment:Neuml_Txtai_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment