Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Neuml Txtai Content Storage

From Leeroopedia


Knowledge Sources
Domains Database, Content_Management
Last Updated 2026-02-09 17:00 GMT

Overview

Content Storage is txtai's relational database layer for persisting document content and metadata alongside vector indexes, enabling combined SQL queries and similarity search in a single system.

Description

When txtai indexes documents, the vector representations are stored in an ANN backend, but the original text, metadata fields, and document identifiers need a separate storage mechanism. The Content Storage principle addresses this by providing an RDBMS-backed document store that runs in parallel with the vector index. When content storage is enabled (via the content configuration flag), each indexed document's full text, numeric id, and any user-defined metadata columns are persisted in a relational database table. This creates a dual-index architecture where the ANN backend handles similarity ranking and the RDBMS handles attribute filtering, text retrieval, and aggregation.

The content storage layer is built on a common Database base class with a concrete RDBMS implementation that supports both SQLite (embedded, default) and PostgreSQL (client-server). The RDBMS class manages table creation, batch inserts, updates, deletes, and SQL query execution. The schema is dynamically generated from the document metadata: each unique metadata key becomes a column, with types inferred from the data (text, integer, real). A special text column holds the document content, and an id column serves as the primary key linking rows to their corresponding vectors in the ANN index.

The power of content storage emerges when combined with txtai's SQL query interface. Users can write queries like:

SELECT id, text, score FROM txtai WHERE similar("search query") AND category = "research" ORDER BY date DESC LIMIT 10

This seamlessly blends vector similarity with relational predicates. Without content storage, txtai returns only (id, score) tuples from similarity search. With content storage enabled, results include the full document text and all metadata fields, making the system a complete document retrieval engine rather than just a vector index.

Content storage also supports transactional updates. When documents are added, modified, or deleted through the Embeddings upsert and delete methods, the RDBMS table is updated atomically alongside the ANN index. For SQLite, this uses file-level locking; for PostgreSQL, it uses standard database transactions. This ensures that the content database and vector index remain consistent even under concurrent write operations.

Usage

Enable content storage when you need to:

  • Retrieve the original document text alongside search results
  • Filter search results by metadata attributes (dates, categories, tags)
  • Run SQL aggregation over search results (COUNT, AVG, GROUP BY)
  • Perform hybrid queries that combine similarity with relational predicates

Content storage adds modest overhead to indexing (the RDBMS insert cost) and increases disk usage proportionally to document size, so it should be omitted for pure vector-similarity workloads where only ids and scores are needed. Choose SQLite for single-process embedded deployments and PostgreSQL for multi-process or multi-machine deployments requiring concurrent access.

Theoretical Basis

1. Document-Vector Mapping: Each document is assigned a monotonically increasing integer offset that serves as its position in the ANN index. The RDBMS table stores a mapping from user-facing id (string or integer) to this internal offset, ensuring that similarity search results (which return offsets) can be joined back to document content in constant time via primary key lookup.

2. Dual-Index Architecture: The system maintains two parallel indexes: (a) the ANN index over vector embeddings for similarity search, and (b) the RDBMS table with B-tree indexes over metadata columns for attribute filtering. Query execution first performs similarity search to produce a candidate set, then applies SQL predicates to filter and rank the candidates, or vice versa depending on the query plan. This architecture decouples the concerns of semantic retrieval and structured data management.

3. Schema Inference: Column schemas are inferred dynamically during the first index operation. For each metadata key across all documents, the system determines the most specific type that accommodates all observed values: integer if all values are integral, real if any are floating-point, and text otherwise. This schema is stored and enforced for subsequent upsert operations. New metadata keys encountered in later upserts trigger an ALTER TABLE to add the corresponding column.

4. Hybrid Query Execution: When a SQL query contains both a similar() function call and relational predicates, txtai's query engine decomposes the query into two phases. The similarity phase produces (id, score) pairs from the ANN index. These are injected into the RDBMS as a temporary result set, and the relational predicates are applied via standard SQL JOIN and WHERE clauses. The result is a unified ranked list that satisfies both semantic and relational constraints.

5. Column Storage and Retrieval: The columns configuration parameter allows users to declare which metadata fields should be stored as indexed columns versus stored as unindexed JSON. Indexed columns support fast filtering and sorting via B-tree indexes; JSON-stored fields are available in results but cannot be used in WHERE clauses efficiently. This distinction lets users balance query flexibility against storage and indexing overhead for schemas with many optional metadata fields.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment