Implementation:Neuml Txtai Embeddings Index
| Knowledge Sources | |
|---|---|
| Domains | Semantic_Search, NLP |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for building vector indexes from document collections provided by the txtai library.
Description
The Embeddings.index method builds a complete embeddings index from a collection of documents. It performs the full indexing pipeline: initializes the index state (creating the database, scoring, subindexes, and graph instances), normalizes input documents through a Stream, transforms them into dense vectors via a Transform, optionally applies PCA dimensionality reduction, constructs an ANN index from the vectors, and populates all auxiliary data structures. This method overwrites any existing index. The operation uses a temporary file buffer for the numpy array to manage memory for large corpora.
Usage
Use this method to build a new embeddings index from scratch. It is appropriate for initial index creation or when the entire corpus needs to be reindexed. For adding documents to an existing index, use upsert() instead. The reindex parameter is used internally when reconstructing an index from a stored database with a new configuration.
Code Reference
Source Location
- Repository: txtai
- File:
src/python/txtai/embeddings/base.py - Lines: L103-153
Signature
def index(self, documents, reindex=False, checkpoint=None):
Import
from txtai.embeddings import Embeddings
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| documents | iterable | Yes | An iterable of documents in one of several formats: (id, data, tags) tuples, (id, data) tuples, dict objects with named fields, or plain str values. The Stream normalizer converts all formats into a canonical internal representation. For plain strings, sequential integer IDs are assigned automatically. |
| reindex | bool | No | When True, indicates this is a reindex operation where the existing database should be preserved and only the vector index rebuilt. The database creation step is skipped. Defaults to False. |
| checkpoint | str or None | No | Path to a checkpoint directory. When provided, enables indexing restart capability. If the operation is interrupted and restarted with the same checkpoint path, it resumes from where it left off. Defaults to None (no checkpointing). |
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | None | This method returns None. It operates via side effects, populating: self.ann (ANN index with document embeddings), self.ids (ID mapping list, only when content storage is disabled and not reindexing), self.config["dimensions"] (vector dimensionality), self.reducer (PCA model if configured), self.scoring (sparse index if configured), self.indexes (subindexes if configured), self.graph (graph index if configured), self.database (document database if content is enabled). |
Usage Examples
Basic Example
from txtai.embeddings import Embeddings
# Create embeddings instance
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
# Index documents as (id, text, None) tuples
documents = [
(0, "US tops 5 million confirmed virus cases", None),
(1, "Canada's last fully intact ice shelf has suddenly collapsed", None),
(2, "Beijing launches high-tech citywide expenses tracking", None),
(3, "The National Park Service warns against sacrificing slower friends", None),
(4, "Maine moose are getting ticks at an alarming rate", None)
]
embeddings.index(documents)
# Verify the index was built
print(embeddings.count())
# Output: 5
Example: Indexing Plain Strings
from txtai.embeddings import Embeddings
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2"
})
# Plain strings get auto-assigned integer IDs (0, 1, 2, ...)
embeddings.index([
"Semantic search with deep learning",
"Natural language processing advances",
"Vector databases for AI applications"
])
print(embeddings.count())
# Output: 3
Example: Indexing Dictionaries
from txtai.embeddings import Embeddings
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
# Index dictionaries with custom fields
documents = [
{"id": "doc1", "text": "Machine learning fundamentals", "category": "AI"},
{"id": "doc2", "text": "Database optimization techniques", "category": "DB"},
{"id": "doc3", "text": "Neural network architectures", "category": "AI"}
]
embeddings.index(documents)
# Search returns dict results when content is enabled
results = embeddings.search("deep learning", 2)
print(results)
Example: Indexing with Checkpoint
from txtai.embeddings import Embeddings
embeddings = Embeddings({
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": True
})
# Large dataset with checkpoint for restart capability
def generate_documents():
for i in range(100000):
yield (i, f"Document number {i} with searchable content", None)
embeddings.index(generate_documents(), checkpoint="/tmp/index_checkpoint")