Implementation:Neuml Txtai Similarity Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, NLP, Semantic Similarity, Transformers |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for computing semantic similarity between a query and candidate texts using multiple backend strategies provided by txtai.
Description
Similarity extends Labels and serves as a unified similarity interface that supports three backend strategies: zero-shot classification (default, inherited from Labels), cross-encoder scoring (via CrossEncoder), and late interaction scoring (via LateEncoder). The backend is selected at initialization based on the crossencode and lateencode flags. For the zero-shot mode, the query is used as the candidate label and the texts are classified against it, with scores transposed to produce per-query similarity rankings. All modes return results as (id, score) tuples sorted by descending score.
Usage
Use Similarity when you need a flexible similarity pipeline that can switch between zero-shot, cross-encoder, and late interaction backends. It is the primary similarity interface used by other txtai components such as the Reranker pipeline.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File:
src/python/txtai/pipeline/text/similarity.py
Signature
class Similarity(Labels):
def __init__(self, path=None, quantize=False, gpu=True, model=None, dynamic=True, crossencode=False, lateencode=False, **kwargs)
def __call__(self, query, texts, multilabel=True, **kwargs)
def encode(self, data, category)
Import
from txtai.pipeline.text.similarity import Similarity
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | No | Model path; accepts Hugging Face model hub id or local path. |
| quantize | bool | No | If True, quantizes the model to int8 (CPU only). Defaults to False. |
| gpu | bool or int | No | True/False to enable GPU, or a specific GPU device id. Defaults to True. |
| model | Pipeline | No | Optional existing pipeline model to wrap. |
| dynamic | bool | No | If True (default), uses zero-shot classification. If False, uses standard text classification. |
| crossencode | bool | No | If True, uses a cross-encoder backend. Defaults to False. |
| lateencode | bool | No | If True, uses a late interaction encoder backend (e.g. ColBERT). Defaults to False. |
| query | str or list | Yes (call) | Query text or list of query texts. |
| texts | list | Yes (call) | List of candidate text strings to compare against the query. |
| multilabel | bool or None | No (call) | Score normalization mode. Defaults to True (sigmoid). |
Outputs
| Name | Type | Description |
|---|---|---|
| result | list of (int, float) | List of (id, score) tuples sorted by descending score. If query is a string, returns a 1D list. If query is a list, returns a 2D list with one row per query. |
Usage Examples
from txtai.pipeline.text.similarity import Similarity
# Zero-shot similarity (default)
similarity = Similarity()
results = similarity("What is machine learning?", [
"Machine learning is a type of AI",
"Python is a programming language",
"Neural networks process data"
])
# Returns: [(0, 0.92), (2, 0.45), (1, 0.08)]
# Cross-encoder similarity
similarity = Similarity("cross-encoder/ms-marco-MiniLM-L-6-v2", crossencode=True)
results = similarity("What is AI?", ["AI is intelligence", "Python is a language"])
# Late interaction similarity (ColBERT)
similarity = Similarity("colbert-ir/colbertv2.0", lateencode=True)
results = similarity("What is AI?", ["AI is intelligence", "Python is a language"])
# Multiple queries
results = similarity(["What is AI?", "What is Python?"], ["AI is intelligence", "Python is a language"])