Implementation:Neuml Txtai Embeddings Index For RAG
| Knowledge Sources | |
|---|---|
| Domains | NLP, RAG |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for building content-enabled embeddings indexes suitable for RAG pipelines, provided by the txtai library.
Description
The Embeddings class is the core indexing engine in txtai. For RAG use cases, it must be configured with content=True so that the original document text is stored in a SQLite database alongside the dense vector index. This dual storage enables the RAG pipeline to retrieve full text passages (not just IDs and scores) when searching for context.
Index building proceeds through the Embeddings.index(documents) method, which accepts an iterable of documents. Documents can be provided as (id, text, tags) tuples, (id, text) tuples, or plain strings (in which case auto-generated integer IDs are assigned). Internally, the method transforms each document into a dense vector using the configured embedding model, builds an approximate nearest neighbor (ANN) index over the vectors, and stores the original text in the content database.
The configuration dictionary controls the embedding model, ANN backend, scoring, graph, and content storage settings. The content: True flag is the critical setting that distinguishes a RAG-ready index from a pure vector index. Additional options include path (the sentence-transformer model), backend (the ANN library, such as faiss or hnsw), and hybrid search settings.
Usage
Use Embeddings index building for RAG when you need to:
- Create a searchable knowledge base that returns full document text for RAG context.
- Index text chunks produced by Textractor into a vector + content store.
- Build an index that supports
batchsearchreturning dictionaries withid,text, andscorefields. - Prepare an index for use with the txtai
RAGpipeline.
Code Reference
Source Location
- Repository: txtai
- File:
src/python/txtai/embeddings/base.py - Lines: L30-153
Signature
class Embeddings:
def __init__(self, config=None, models=None, **kwargs):
"""
Creates a new embeddings index.
Args:
config: embeddings configuration
models: models cache, used for model sharing between embeddings
kwargs: additional configuration as keyword args
"""
...
def index(self, documents, reindex=False, checkpoint=None):
"""
Builds an embeddings index. This method overwrites an existing index.
Args:
documents: iterable of (id, data, tags), (id, data) or data
reindex: if this is a reindex operation, defaults to False
checkpoint: optional checkpoint directory for restart
"""
...
Import
from txtai.embeddings import Embeddings
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | dict |
Yes | Configuration dictionary. Must include content: True for RAG. Common keys: path (embedding model), backend (ANN library), content (enable document storage).
|
| documents | iterable |
Yes | Iterable of (id, text, tags), (id, text), or plain str. Typically chunked text from Textractor.
|
| reindex | bool |
No | If True, skips database creation (reindexes existing content). Default: False
|
| checkpoint | str or None |
No | Directory path for indexing checkpoints to enable restart. Default: None
|
Outputs
| Name | Type | Description |
|---|---|---|
| embeddings | Embeddings (mutated in place) |
Fully built in-memory index with ANN index for vector search and SQLite content database for text retrieval. Search results return dictionaries with id, text, and score fields.
|
Usage Examples
Basic Example: Build a RAG Index from Text Chunks
from txtai.embeddings import Embeddings
# Create embeddings with content storage enabled (required for RAG)
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
# Index text chunks (auto-assigned integer IDs)
chunks = [
"Retrieval augmented generation combines search with LLMs.",
"Embeddings transform text into dense vector representations.",
"Approximate nearest neighbor indexes enable fast similarity search.",
]
embeddings.index(chunks)
# Search returns dictionaries with id, text, and score
results = embeddings.search("How does RAG work?", limit=2)
for result in results:
print(f"Score: {result['score']:.4f} - {result['text']}")
Indexing with Explicit IDs
from txtai.embeddings import Embeddings
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
# Index with (id, text) tuples for explicit ID control
documents = [
("doc-001", "Machine learning automates analytical model building."),
("doc-002", "Neural networks are inspired by biological neural networks."),
("doc-003", "Deep learning uses multiple layers to learn representations."),
]
embeddings.index(documents)
Full RAG Index Pipeline with Textractor
import glob
from txtai.embeddings import Embeddings
from txtai.pipeline import Textractor
# Step 1: Collect documents
files = glob.glob("/data/knowledge_base/**/*.pdf", recursive=True)
# Step 2: Extract and chunk text
textractor = Textractor(paragraphs=True, minlength=100)
chunks = []
for filepath in files:
result = textractor(filepath)
if isinstance(result, list):
chunks.extend(result)
else:
chunks.append(result)
# Step 3: Build RAG-ready index
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
embeddings.index(chunks)
# Verify: search returns full text
results = embeddings.search("project requirements", limit=3)
for result in results:
print(f"[{result['score']:.3f}] {result['text'][:100]}...")
Saving and Loading an Index
from txtai.embeddings import Embeddings
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
embeddings.index(["First chunk.", "Second chunk.", "Third chunk."])
# Save index to disk
embeddings.save("/data/indexes/rag_index")
# Load index later
embeddings_loaded = Embeddings()
embeddings_loaded.load("/data/indexes/rag_index")
results = embeddings_loaded.search("chunk", limit=3)