Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Neuml Txtai Embeddings Index For RAG

From Leeroopedia


Knowledge Sources
Domains NLP, RAG
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for building content-enabled embeddings indexes suitable for RAG pipelines, provided by the txtai library.

Description

The Embeddings class is the core indexing engine in txtai. For RAG use cases, it must be configured with content=True so that the original document text is stored in a SQLite database alongside the dense vector index. This dual storage enables the RAG pipeline to retrieve full text passages (not just IDs and scores) when searching for context.

Index building proceeds through the Embeddings.index(documents) method, which accepts an iterable of documents. Documents can be provided as (id, text, tags) tuples, (id, text) tuples, or plain strings (in which case auto-generated integer IDs are assigned). Internally, the method transforms each document into a dense vector using the configured embedding model, builds an approximate nearest neighbor (ANN) index over the vectors, and stores the original text in the content database.

The configuration dictionary controls the embedding model, ANN backend, scoring, graph, and content storage settings. The content: True flag is the critical setting that distinguishes a RAG-ready index from a pure vector index. Additional options include path (the sentence-transformer model), backend (the ANN library, such as faiss or hnsw), and hybrid search settings.

Usage

Use Embeddings index building for RAG when you need to:

  • Create a searchable knowledge base that returns full document text for RAG context.
  • Index text chunks produced by Textractor into a vector + content store.
  • Build an index that supports batchsearch returning dictionaries with id, text, and score fields.
  • Prepare an index for use with the txtai RAG pipeline.

Code Reference

Source Location

  • Repository: txtai
  • File: src/python/txtai/embeddings/base.py
  • Lines: L30-153

Signature

class Embeddings:
    def __init__(self, config=None, models=None, **kwargs):
        """
        Creates a new embeddings index.

        Args:
            config: embeddings configuration
            models: models cache, used for model sharing between embeddings
            kwargs: additional configuration as keyword args
        """
        ...

    def index(self, documents, reindex=False, checkpoint=None):
        """
        Builds an embeddings index. This method overwrites an existing index.

        Args:
            documents: iterable of (id, data, tags), (id, data) or data
            reindex: if this is a reindex operation, defaults to False
            checkpoint: optional checkpoint directory for restart
        """
        ...

Import

from txtai.embeddings import Embeddings

I/O Contract

Inputs

Name Type Required Description
config dict Yes Configuration dictionary. Must include content: True for RAG. Common keys: path (embedding model), backend (ANN library), content (enable document storage).
documents iterable Yes Iterable of (id, text, tags), (id, text), or plain str. Typically chunked text from Textractor.
reindex bool No If True, skips database creation (reindexes existing content). Default: False
checkpoint str or None No Directory path for indexing checkpoints to enable restart. Default: None

Outputs

Name Type Description
embeddings Embeddings (mutated in place) Fully built in-memory index with ANN index for vector search and SQLite content database for text retrieval. Search results return dictionaries with id, text, and score fields.

Usage Examples

Basic Example: Build a RAG Index from Text Chunks

from txtai.embeddings import Embeddings

# Create embeddings with content storage enabled (required for RAG)
embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Index text chunks (auto-assigned integer IDs)
chunks = [
    "Retrieval augmented generation combines search with LLMs.",
    "Embeddings transform text into dense vector representations.",
    "Approximate nearest neighbor indexes enable fast similarity search.",
]

embeddings.index(chunks)

# Search returns dictionaries with id, text, and score
results = embeddings.search("How does RAG work?", limit=2)
for result in results:
    print(f"Score: {result['score']:.4f} - {result['text']}")

Indexing with Explicit IDs

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Index with (id, text) tuples for explicit ID control
documents = [
    ("doc-001", "Machine learning automates analytical model building."),
    ("doc-002", "Neural networks are inspired by biological neural networks."),
    ("doc-003", "Deep learning uses multiple layers to learn representations."),
]

embeddings.index(documents)

Full RAG Index Pipeline with Textractor

import glob
from txtai.embeddings import Embeddings
from txtai.pipeline import Textractor

# Step 1: Collect documents
files = glob.glob("/data/knowledge_base/**/*.pdf", recursive=True)

# Step 2: Extract and chunk text
textractor = Textractor(paragraphs=True, minlength=100)
chunks = []
for filepath in files:
    result = textractor(filepath)
    if isinstance(result, list):
        chunks.extend(result)
    else:
        chunks.append(result)

# Step 3: Build RAG-ready index
embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})
embeddings.index(chunks)

# Verify: search returns full text
results = embeddings.search("project requirements", limit=3)
for result in results:
    print(f"[{result['score']:.3f}] {result['text'][:100]}...")

Saving and Loading an Index

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

embeddings.index(["First chunk.", "Second chunk.", "Third chunk."])

# Save index to disk
embeddings.save("/data/indexes/rag_index")

# Load index later
embeddings_loaded = Embeddings()
embeddings_loaded.load("/data/indexes/rag_index")

results = embeddings_loaded.search("chunk", limit=3)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment