Implementation:Neuml Txtai Embeddings Index

Knowledge Sources	txtai txtai Documentation
Domains	Semantic_Search, NLP
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for building vector indexes from document collections provided by the txtai library.

Description

The Embeddings.index method builds a complete embeddings index from a collection of documents. It performs the full indexing pipeline: initializes the index state (creating the database, scoring, subindexes, and graph instances), normalizes input documents through a Stream, transforms them into dense vectors via a Transform, optionally applies PCA dimensionality reduction, constructs an ANN index from the vectors, and populates all auxiliary data structures. This method overwrites any existing index. The operation uses a temporary file buffer for the numpy array to manage memory for large corpora.

Usage

Use this method to build a new embeddings index from scratch. It is appropriate for initial index creation or when the entire corpus needs to be reindexed. For adding documents to an existing index, use upsert() instead. The reindex parameter is used internally when reconstructing an index from a stored database with a new configuration.

Code Reference

Source Location

Repository: txtai
File: src/python/txtai/embeddings/base.py
Lines: L103-153

Signature

def index(self, documents, reindex=False, checkpoint=None):

Import

from txtai.embeddings import Embeddings

I/O Contract

Inputs

Name	Type	Required	Description
documents	iterable	Yes	An iterable of documents in one of several formats: (id, data, tags) tuples, (id, data) tuples, dict objects with named fields, or plain str values. The Stream normalizer converts all formats into a canonical internal representation. For plain strings, sequential integer IDs are assigned automatically.
reindex	bool	No	When True, indicates this is a reindex operation where the existing database should be preserved and only the vector index rebuilt. The database creation step is skipped. Defaults to False.
checkpoint	str or None	No	Path to a checkpoint directory. When provided, enables indexing restart capability. If the operation is interrupted and restarted with the same checkpoint path, it resumes from where it left off. Defaults to None (no checkpointing).

Outputs

Name	Type	Description
(none)	None	This method returns None. It operates via side effects, populating: self.ann (ANN index with document embeddings), self.ids (ID mapping list, only when content storage is disabled and not reindexing), self.config["dimensions"] (vector dimensionality), self.reducer (PCA model if configured), self.scoring (sparse index if configured), self.indexes (subindexes if configured), self.graph (graph index if configured), self.database (document database if content is enabled).

Usage Examples

Basic Example

from txtai.embeddings import Embeddings

# Create embeddings instance
embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Index documents as (id, text, None) tuples
documents = [
    (0, "US tops 5 million confirmed virus cases", None),
    (1, "Canada's last fully intact ice shelf has suddenly collapsed", None),
    (2, "Beijing launches high-tech citywide expenses tracking", None),
    (3, "The National Park Service warns against sacrificing slower friends", None),
    (4, "Maine moose are getting ticks at an alarming rate", None)
]

embeddings.index(documents)

# Verify the index was built
print(embeddings.count())
# Output: 5

Example: Indexing Plain Strings

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2"
})

# Plain strings get auto-assigned integer IDs (0, 1, 2, ...)
embeddings.index([
    "Semantic search with deep learning",
    "Natural language processing advances",
    "Vector databases for AI applications"
])

print(embeddings.count())
# Output: 3

Example: Indexing Dictionaries

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Index dictionaries with custom fields
documents = [
    {"id": "doc1", "text": "Machine learning fundamentals", "category": "AI"},
    {"id": "doc2", "text": "Database optimization techniques", "category": "DB"},
    {"id": "doc3", "text": "Neural network architectures", "category": "AI"}
]

embeddings.index(documents)

# Search returns dict results when content is enabled
results = embeddings.search("deep learning", 2)
print(results)

Example: Indexing with Checkpoint

from txtai.embeddings import Embeddings

embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True
})

# Large dataset with checkpoint for restart capability
def generate_documents():
    for i in range(100000):
        yield (i, f"Document number {i} with searchable content", None)

embeddings.index(generate_documents(), checkpoint="/tmp/index_checkpoint")

Related Pages

Implements Principle

Principle:Neuml_Txtai_Document_Indexing

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment