Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Neuml Txtai Semantic Search

From Leeroopedia


Knowledge Sources
Domains Semantic_Search, NLP, Embeddings
Last Updated 2026-02-10 00:00 GMT

Overview

End-to-end process for building and querying a semantic search index using txtai's Embeddings class.

Description

This workflow outlines the standard procedure for creating a semantic search system that finds documents by meaning rather than exact keyword matches. Text data is transformed into dense vector embeddings using a transformer model, then indexed using an approximate nearest neighbor (ANN) backend. Queries are vectorized using the same model and matched against the index to retrieve the most semantically similar results. The system supports both simple index-only search (returning ID and score) and content-enabled search with full document storage using SQL-like filtering.

Usage

Execute this workflow when you have a collection of text documents (articles, records, descriptions, etc.) and need to find entries that are semantically related to a natural language query. This is the foundational workflow for txtai and underpins the RAG and Agent workflows.

Execution Steps

Step 1: Configure Embeddings

Define the embeddings configuration, specifying the vector model path and whether content storage is enabled. Configuration can be passed as a dictionary or keyword arguments. Key settings include the transformer model path, content storage flag, ANN backend selection, and optional hybrid scoring.

Key considerations:

  • Choose an appropriate sentence-transformer or embedding model for your domain
  • Enable content storage if you need to retrieve full document text alongside search results
  • Configure hybrid search (dense + sparse) for improved precision on keyword-heavy queries

Step 2: Prepare Documents

Format input documents as an iterable of tuples or plain strings. The Embeddings class accepts data in three formats: (id, data, tags), (id, data), or plain data strings. When plain strings are provided, auto-generated sequential IDs are assigned.

Key considerations:

  • Use explicit IDs when documents need to be updated or deleted later
  • Tags enable metadata filtering in SQL-style queries
  • Documents can be streamed from generators for memory-efficient processing of large datasets

Step 3: Build the Index

Call the index method on the Embeddings instance to transform all documents into vectors and build the ANN index. This step vectorizes each document through the configured model, stores vectors in the ANN backend, and optionally persists document content in a SQLite or DuckDB database.

What happens:

  • Documents are streamed and normalized into a consistent internal format
  • Each document's text is transformed into a dense embedding vector
  • Vectors are loaded into the ANN index (Faiss, Annoy, HNSW, or other configured backend)
  • If content storage is enabled, document text and metadata are stored in the database
  • Optional sparse scoring index (BM25/TF-IDF) is built for hybrid search

Step 4: Query the Index

Run semantic search queries against the built index. The search method accepts a natural language query string and returns the most similar documents ranked by cosine similarity score. When content storage is enabled, queries can use SQL-style syntax to combine semantic search with metadata filters.

Query modes:

  • Simple vector search returns (id, score) tuples
  • Content-enabled search returns full document dictionaries with text and metadata
  • SQL-style queries enable filters like "select text, score from txtai where similar('query') and tags = 'category'"

Step 5: Save and Load the Index

Persist the index to disk or cloud storage for reuse. The save method writes all index components (ANN index, database, configuration) to a directory. The load method restores a previously saved index. Indexes can also be published to and loaded from Hugging Face Hub.

Key considerations:

  • Save indexes to avoid rebuilding on every application restart
  • Cloud storage providers (Hugging Face Hub, S3 via libcloud) enable sharing indexes
  • Archived indexes can be compressed as tar or zip files

Execution Diagram

GitHub URL

Workflow Repository