Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Neuml Txtai Embeddings Index

From Leeroopedia


Knowledge Sources
Domains Semantic_Search, NLP
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for building vector indexes from document collections provided by the txtai library.

Description

The Embeddings.index method builds a complete embeddings index from a collection of documents. It performs the full indexing pipeline: initializes the index state (creating the database, scoring, subindexes, and graph instances), normalizes input documents through a Stream, transforms them into dense vectors via a Transform, optionally applies PCA dimensionality reduction, constructs an ANN index from the vectors, and populates all auxiliary data structures. This method overwrites any existing index. The operation uses a temporary file buffer for the numpy array to manage memory for large corpora.

Usage

Use this method to build a new embeddings index from scratch. It is appropriate for initial index creation or when the entire corpus needs to be reindexed. For adding documents to an existing index, use upsert() instead. The reindex parameter is used internally when reconstructing an index from a stored database with a new configuration.

Code Reference

Source Location

  • Repository: txtai
  • File: src/python/txtai/embeddings/base.py
  • Lines: L103-153

Signature

def index(self, documents, reindex=False, checkpoint=None):

Import

from txtai.embeddings import Embeddings

I/O Contract

Inputs

Name Type Required Description
documents iterable Yes An iterable of documents in one of several formats: (id, data, tags) tuples, (id, data) tuples, dict objects with named fields, or plain str values. The Stream normalizer converts all formats into a canonical internal representation. For plain strings, sequential integer IDs are assigned automatically.
reindex bool No When True, indicates this is a reindex operation where the existing database should be preserved and only the vector index rebuilt. The database creation step is skipped. Defaults to False.
checkpoint str or None No Path to a checkpoint directory. When provided, enables indexing restart capability. If the operation is interrupted and restarted with the same checkpoint path, it resumes from where it left off. Defaults to None (no checkpointing).

Outputs

Name Type Description
(none) None This method returns None. It operates via side effects, populating: self.ann (ANN index with document embeddings), self.ids (ID mapping list, only when content storage is disabled and not reindexing), self.config["dimensions"] (vector dimensionality), self.reducer (PCA model if configured), self.scoring (sparse index if configured), self.indexes (subindexes if configured), self.graph (graph index if configured), self.database (document database if content is enabled).

Usage Examples

Basic Example

from txtai.embeddings import Embeddings

# Create embeddings instance
embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Index documents as (id, text, None) tuples
documents = [
    (0, "US tops 5 million confirmed virus cases", None),
    (1, "Canada's last fully intact ice shelf has suddenly collapsed", None),
    (2, "Beijing launches high-tech citywide expenses tracking", None),
    (3, "The National Park Service warns against sacrificing slower friends", None),
    (4, "Maine moose are getting ticks at an alarming rate", None)
]

embeddings.index(documents)

# Verify the index was built
print(embeddings.count())
# Output: 5

Example: Indexing Plain Strings

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2"
})

# Plain strings get auto-assigned integer IDs (0, 1, 2, ...)
embeddings.index([
    "Semantic search with deep learning",
    "Natural language processing advances",
    "Vector databases for AI applications"
])

print(embeddings.count())
# Output: 3

Example: Indexing Dictionaries

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Index dictionaries with custom fields
documents = [
    {"id": "doc1", "text": "Machine learning fundamentals", "category": "AI"},
    {"id": "doc2", "text": "Database optimization techniques", "category": "DB"},
    {"id": "doc3", "text": "Neural network architectures", "category": "AI"}
]

embeddings.index(documents)

# Search returns dict results when content is enabled
results = embeddings.search("deep learning", 2)
print(results)

Example: Indexing with Checkpoint

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Large dataset with checkpoint for restart capability
def generate_documents():
    for i in range(100000):
        yield (i, f"Document number {i} with searchable content", None)

embeddings.index(generate_documents(), checkpoint="/tmp/index_checkpoint")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment