Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Neuml Txtai SQL Query Processing

From Leeroopedia
Revision as of 17:20, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Neuml_Txtai_SQL_Query_Processing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Query_Processing, SQL
Last Updated 2026-02-09 17:00 GMT

Overview

SQL Query Processing is txtai's pipeline for translating user queries into combined similarity and relational operations, parsing SQL-like syntax into an execution plan that coordinates vector search with database filtering.

Description

txtai exposes a SQL-like query language that extends standard SQL with a similar() function for embedding-based similarity search. The SQL Query Processing principle encompasses the full pipeline that transforms a raw query string into executed results: lexical analysis, token classification, expression tree construction, query rewriting, execution against the ANN index and RDBMS database, and result aggregation. This pipeline allows users to write a single query like:

SELECT text, score FROM txtai WHERE similar("machine learning") AND year >= 2023 LIMIT 20

and receive results that combine semantic relevance with attribute-based filtering.

The processing pipeline consists of four cooperating modules:

  • SQL Parser -- Tokenizes the query string and classifies each token as either a similarity token (the similar() call and its arguments) or a relational token (standard SQL keywords, column references, operators, and literals). The parser separates the query into two execution paths: the similarity component dispatched to the ANN backend, and the relational component dispatched to the RDBMS engine.
  • SQL Token -- Defines the vocabulary of recognized token types and provides utility methods for token manipulation, including quoting, escaping, and type coercion. It serves as the data model layer, ensuring each lexical element carries its type information through the pipeline.
  • SQL Expression -- Rewrites the parsed query into a form that the underlying database can execute, translating txtai-specific syntax into standard SQL and injecting similarity search results as a virtual table or CTE.
  • SQL Aggregate -- Handles cross-shard result merging in distributed deployments, combining partial results from multiple shards, applying global sorting, deduplicating entries, and enforcing the final LIMIT clause.

Usage

Use SQL Query Processing whenever you need to combine semantic similarity search with structured metadata filtering, sorting, or aggregation. It is the primary interface for querying txtai Embeddings instances that have content storage enabled. Use simple string queries (passed directly to the similarity search) when no relational filtering is needed, and switch to SQL syntax when queries require WHERE clauses, ORDER BY, GROUP BY, or LIMIT. The SQL interface also supports subqueries and nested similar() calls for multi-stage retrieval workflows.

Theoretical Basis

1. Token Classification: The SQL lexer classifies each token in the query string into one of several types: SIMILAR (the similar() function and its enclosed query text), COLUMN (references to metadata fields), OPERATOR (comparison operators, AND, OR, NOT), LITERAL (string and numeric constants), and KEYWORD (SELECT, FROM, WHERE, ORDER BY, LIMIT, GROUP BY). This classification determines how each token is routed through the execution pipeline and enables the separation of similarity and relational concerns.

2. Expression Tree Rewriting: The parsed token stream is assembled into an expression tree that separates the similarity predicate from relational predicates. The tree is then rewritten: the similar() node is replaced with a reference to a temporary results table populated by the ANN search, and the remaining relational nodes are compiled into standard SQL that operates over this temporary table joined with the content database. This rewriting step is where txtai's custom SQL dialect is translated into the target database's native syntax.

3. Query Execution Order: The default execution strategy is similarity-first: the ANN backend returns the top-N candidates (where N is a configurable oversampling factor, typically limit * 10), and these candidates are then filtered by relational predicates in the RDBMS. This order is efficient when the similarity search is selective. An alternative filter-first strategy applies relational predicates before similarity search, useful when metadata filters eliminate most of the corpus and the remaining set is small enough for brute-force similarity scoring.

4. Shard-Level Aggregation: In a distributed index with S shards, each shard executes the full query locally, returning up to limit results. The Aggregate module collects S * limit candidate results, sorts them globally by score, removes duplicates (documents that were indexed in multiple shards), and returns the top limit entries. This scatter-gather pattern guarantees correct global ordering at the cost of transferring S * limit results over the network.

5. SQL Dialect Translation: txtai's SQL syntax is a superset of standard SQL with the similar() extension. The Expression module translates this dialect into the target database's native SQL:

  • For SQLite, it generates standard SQLite-compatible queries
  • For PostgreSQL, it adapts syntax for PG-specific features (e.g., tsvector for full-text search, pgvector operators for vector similarity)

This abstraction keeps query code portable across database backends and ensures that switching between SQLite and PostgreSQL requires no changes to application-level queries.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment