Workflow:Neuml Txtai Semantic Search Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Semantic_Search, Embeddings, Vector_Databases |
| Last Updated | 2026-02-09 18:00 GMT |
Overview
End-to-end process for building a semantic search application that indexes documents into a vector database and retrieves results by meaning rather than keyword matching.
Description
This workflow outlines the standard procedure for creating a semantic search system using txtai's Embeddings class. It covers initializing a vector model, ingesting and indexing documents (text, tuples, or dictionaries), persisting the index to disk, and performing similarity-based queries. The Embeddings class orchestrates dense vector indexes (Faiss, HNSW, etc.), optional sparse scoring (BM25, TF-IDF), a document database (SQLite, DuckDB, PostgreSQL), and an optional knowledge graph into a single unified search interface. Queries can be natural language, SQL with a custom SIMILAR() function, or hybrid combinations of both.
Usage
Execute this workflow when you have a collection of text documents (articles, product descriptions, FAQ entries, book descriptions, etc.) and need to build a search system that understands meaning and intent. This is the foundational txtai use case and serves as a prerequisite for more advanced workflows like RAG and Agent Orchestration.
Execution Steps
Step 1: Configure the Embeddings Instance
Initialize an Embeddings object with the desired configuration. At minimum, specify a vector model path (e.g., a Sentence Transformers model). Optionally enable content storage to persist full document text alongside vectors, configure hybrid search with sparse scoring, set up a graph network for relationship analysis, or select an alternative ANN backend.
Key considerations:
- The path parameter selects the vector embedding model (Hugging Face, llama.cpp, LiteLLM, etc.)
- Setting content=True enables a document database for storing metadata and full text
- Hybrid search requires configuring both a dense vector model and a sparse scoring method
- Backend selection (Faiss, HNSW, NumPy, SQLite, PGVector) depends on dataset size and deployment requirements
Step 2: Prepare and Ingest Documents
Format input data as strings, tuples of (id, text, tags), or dictionaries with custom fields. Stream documents into the index using the index() method. For large datasets, use generator functions to avoid loading everything into memory at once.
Key considerations:
- Strings receive auto-generated sequential IDs
- Tuple format (id, text, tags) allows custom IDs and metadata
- Dictionary format enables storing arbitrary fields when content storage is enabled
- Generators enable efficient processing of datasets that do not fit in memory
Step 3: Build and Persist the Index
After indexing, save the embeddings database to disk using the save() method. This persists the vector index, document database, configuration, and any graph data. The index can later be loaded with load() for serving queries without re-indexing.
Key considerations:
- Save creates a directory containing config, vector index, and database files
- Cloud storage backends (Hugging Face Hub, S3) can be configured for remote persistence
- Indexes can be distributed as compressed archives
Step 4: Execute Semantic Queries
Run queries against the index using the search() method. Natural language queries are converted to vectors and matched against indexed documents by cosine similarity. SQL queries with the SIMILAR() function combine semantic search with structured filtering. Batch queries are supported via batchsearch().
Key considerations:
- Natural language queries return (id, score) tuples or full documents when content is enabled
- SQL mode enables filtering, aggregation, and joining semantic similarity with metadata conditions
- Hybrid search fuses dense and sparse results using configurable weights
- The explain() method provides token-level importance scores for search transparency
Step 5: Update the Index (Optional)
Perform incremental updates using upsert() to add or modify documents without full re-indexing. Use delete() to remove documents. Call reindex() to rebuild with different settings or after schema changes.
Key considerations:
- Upsert supports the same input formats as index (strings, tuples, dictionaries)
- Deleted document IDs are tracked and excluded from search results
- Reindex allows changing the vector model or ANN backend on an existing database