Implementation:Neuml Txtai Scan
| Knowledge Sources | |
|---|---|
| Domains | Embeddings, Search, Query Execution, Hybrid Search |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for scanning indexes to execute query matches, including hybrid search with configurable weights, provided by txtai.
Description
The Scan class executes parsed query clauses against embeddings indexes. It acts as the query execution engine that routes similar() function calls from parsed SQL-style queries to the appropriate index search functions.
Key features:
- Multi-index routing: Query clauses are grouped by target index name and executed in batches. Each clause can optionally specify a target index via the similar() function parameters.
- Candidate management: The number of candidates (results to retrieve from index queries) is configurable per clause. The default is derived intelligently: single filter clauses use the query limit, while multi-token WHERE clauses use 10x the limit to ensure enough candidates survive additional filtering.
- Hybrid score weights: Each clause can specify a weight parameter for hybrid dense/sparse score combination. The maximum weight across clauses is used for the batch.
- Bind parameter resolution: Supports named bind parameters (prefixed with ":") in similar() clause arguments, resolved against a parameters dictionary.
- Clause parsing: The companion Clause class parses similar() function parameters into structured objects with text, index, candidates, and weights attributes. Parameters are distinguished by type: integers are candidates, floats are weights, and strings are index names.
The execution flow is:
- Parse query clauses from parsed queries, grouping by target index.
- Determine candidate counts and weights for each index group.
- Execute batch searches via the provided search function.
- Collect results keyed by query clause UID and sort by query order.
Usage
Use Scan as the internal query execution engine for embeddings search. It is instantiated and used by the embeddings search pipeline to execute the index-scanning portion of parsed SQL-style queries. Understanding Scan is useful for debugging search behavior, customizing hybrid search weights, or working with multi-index configurations.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File:
src/python/txtai/embeddings/search/scan.py
Signature
class Scan:
def __init__(self, search, limit, weights, index)
def __call__(self, queries, parameters) -> list
def parse(self, queries, parameters) -> dict
def bind(self, similar, parameters) -> list
def default(self, queries) -> int
class Clause:
def __init__(self, uid, qid, params)
def parse(self, params)
Import
from txtai.embeddings.search.scan import Scan, Clause
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| search | callable | Yes | Index search function accepting (queries, candidates, weights, index) and returning batch results. |
| limit | int | Yes | Default maximum results per query. |
| weights | float | No | Default hybrid score weights for dense/sparse combination. |
| index | str | No | Default index name when no index is specified in the query clause. |
| queries | list[dict] | Yes (__call__) | List of parsed query dictionaries. Each may contain similar (list of parameter lists) and where (SQL WHERE clause string) keys. |
| parameters | list[dict] | No | List of bind parameter dictionaries, one per query. Used to resolve ":" prefixed placeholders in similar() arguments. |
Outputs
| Name | Type | Description |
|---|---|---|
| results | list[tuple(int, list)] | List of (query_id, results) tuples sorted by query clause UID. Each result is a list of (id, score) tuples from the index search. |
| parsed clauses | dict | Dictionary mapping index names to lists of Clause objects. |
| default candidates | int | Default candidate count: limit for simple queries, limit * 10 for multi-token WHERE clauses. |
Usage Examples
from txtai.embeddings import Embeddings
# Scan is used internally by embeddings search
# The following demonstrates the query patterns that invoke Scan
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
embeddings.index([
{"id": 0, "text": "machine learning models", "category": "AI"},
{"id": 1, "text": "web application development", "category": "dev"},
{"id": 2, "text": "natural language processing", "category": "AI"},
])
# Simple similar() query - Scan executes with default candidates
results = embeddings.search(
"SELECT id, text, score FROM txtai WHERE similar('machine learning')"
)
# similar() with candidate count - Scan uses 100 candidates
results = embeddings.search(
"SELECT id, text, score FROM txtai WHERE similar('machine learning', 100)"
)
# similar() with hybrid weight - Scan passes 0.5 weight to search
results = embeddings.search(
"SELECT id, text, score FROM txtai WHERE similar('machine learning', 0.5)"
)
# similar() with target index - Scan routes to named subindex
results = embeddings.search(
"SELECT id, text, score FROM txtai WHERE similar('machine learning', 'sparse')"
)
# Bind parameters - Scan resolves :query placeholder
results = embeddings.search(
"SELECT id, text, score FROM txtai WHERE similar(:query)",
parameters={"query": "machine learning"}
)
# Multi-clause WHERE - Scan uses 10x candidates for better filtering
results = embeddings.search(
"SELECT id, text, score FROM txtai WHERE similar('machine learning') AND category = 'AI'"
)