Workflow:Neuml Txtai Semantic Search Pipeline

Knowledge Sources	txtai txtai Embeddings Docs txtai Indexing Guide txtai Query Guide
Domains	Semantic_Search, Embeddings, Vector_Databases
Last Updated	2026-02-09 18:00 GMT

Overview

End-to-end process for building a semantic search application that indexes documents into a vector database and retrieves results by meaning rather than keyword matching.

Description

This workflow outlines the standard procedure for creating a semantic search system using txtai's Embeddings class. It covers initializing a vector model, ingesting and indexing documents (text, tuples, or dictionaries), persisting the index to disk, and performing similarity-based queries. The Embeddings class orchestrates dense vector indexes (Faiss, HNSW, etc.), optional sparse scoring (BM25, TF-IDF), a document database (SQLite, DuckDB, PostgreSQL), and an optional knowledge graph into a single unified search interface. Queries can be natural language, SQL with a custom SIMILAR() function, or hybrid combinations of both.

Usage

Execute this workflow when you have a collection of text documents (articles, product descriptions, FAQ entries, book descriptions, etc.) and need to build a search system that understands meaning and intent. This is the foundational txtai use case and serves as a prerequisite for more advanced workflows like RAG and Agent Orchestration.

Execution Steps

Step 1: Configure the Embeddings Instance

Initialize an Embeddings object with the desired configuration. At minimum, specify a vector model path (e.g., a Sentence Transformers model). Optionally enable content storage to persist full document text alongside vectors, configure hybrid search with sparse scoring, set up a graph network for relationship analysis, or select an alternative ANN backend.

Key considerations:

The path parameter selects the vector embedding model (Hugging Face, llama.cpp, LiteLLM, etc.)
Setting content=True enables a document database for storing metadata and full text
Hybrid search requires configuring both a dense vector model and a sparse scoring method
Backend selection (Faiss, HNSW, NumPy, SQLite, PGVector) depends on dataset size and deployment requirements

Step 2: Prepare and Ingest Documents

Format input data as strings, tuples of (id, text, tags), or dictionaries with custom fields. Stream documents into the index using the index() method. For large datasets, use generator functions to avoid loading everything into memory at once.

Key considerations:

Strings receive auto-generated sequential IDs
Tuple format (id, text, tags) allows custom IDs and metadata
Dictionary format enables storing arbitrary fields when content storage is enabled
Generators enable efficient processing of datasets that do not fit in memory

Step 3: Build and Persist the Index

After indexing, save the embeddings database to disk using the save() method. This persists the vector index, document database, configuration, and any graph data. The index can later be loaded with load() for serving queries without re-indexing.

Key considerations:

Save creates a directory containing config, vector index, and database files
Cloud storage backends (Hugging Face Hub, S3) can be configured for remote persistence
Indexes can be distributed as compressed archives

Step 4: Execute Semantic Queries

Run queries against the index using the search() method. Natural language queries are converted to vectors and matched against indexed documents by cosine similarity. SQL queries with the SIMILAR() function combine semantic search with structured filtering. Batch queries are supported via batchsearch().

Key considerations:

Natural language queries return (id, score) tuples or full documents when content is enabled
SQL mode enables filtering, aggregation, and joining semantic similarity with metadata conditions
Hybrid search fuses dense and sparse results using configurable weights
The explain() method provides token-level importance scores for search transparency

Step 5: Update the Index (Optional)

Perform incremental updates using upsert() to add or modify documents without full re-indexing. Use delete() to remove documents. Call reindex() to rebuild with different settings or after schema changes.

Key considerations:

Upsert supports the same input formats as index (strings, tuples, dictionaries)
Deleted document IDs are tracked and excluded from search results
Reindex allows changing the vector model or ANN backend on an existing database

Execution Diagram

GitHub URL

Workflow Repository