Implementation:Neuml Txtai Embeddings Reindex

Overview

This page documents the Embeddings.__init__ constructor and the Embeddings.reindex method, which together provide the mechanism for creating embeddings indexes with custom models and rebuilding existing indexes when the underlying model changes. These are the primary integration points for connecting fine-tuned or exported models to the txtai search pipeline.

API

Embeddings.init

def __init__(self, config=None, models=None, **kwargs)

Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized. The constructor initializes all internal components to None and applies the provided configuration.

Parameters:

Name	Type	Default	Description
config	dict or None	None	Embeddings configuration dictionary. Supports all txtai configuration keys including `path`, `content`, `scoring`, `graph`, and others.
models	dict or None	None	Models cache for sharing loaded models between multiple embeddings instances (used with subindexes).
**kwargs	dict	{}	Additional configuration as keyword arguments. Merged with `config` if both are provided.

Returns: None

Key configuration parameters for model integration:

Parameter	Type	Description
path	str	Path to the embedding model. Accepts a Hugging Face model hub identifier or local directory path.
content	bool or str	Enables document content storage. Required for reindexing.
dimensions	int	Number of embedding dimensions (auto-detected during indexing).
scoring	str or dict	Sparse scoring configuration (e.g., `"bm25"` or a dict with method and parameters).
graph	bool or dict	Enables graph network storage.
indexes	dict	Subindex configurations for multi-index setups.

Initialization sequence:

All internal components are set to None: config, reducer, model, ann, ids, database, functions, graph, scoring, query, archive, indexes.
The models cache is stored for shared model access.
If both config and kwargs are provided, they are merged into a single dictionary (kwargs values take precedence).
self.configure(config) is called to load configuration-driven models (vectors model, scoring, query model).

Example:

from txtai import Embeddings

# Create with default model (sentence-transformers/all-MiniLM-L6-v2)
embeddings = Embeddings()

# Create with a custom fine-tuned model
embeddings = Embeddings(path="path/to/finetuned-model")

# Create with content storage enabled (required for reindex)
embeddings = Embeddings(path="path/to/finetuned-model", content=True)

# Create with a dict configuration
config = {
    "path": "path/to/finetuned-model",
    "content": True,
    "scoring": {"method": "bm25", "terms": True}
}
embeddings = Embeddings(config)

# Configuration via kwargs
embeddings = Embeddings(path="sentence-transformers/all-MiniLM-L6-v2", content=True)

Embeddings.reindex

def reindex(self, config=None, function=None, **kwargs)

Recreates the embeddings index using a new configuration. This method only works if document content storage is enabled (content=True), because the original document text must be available to regenerate vectors with the new model.

Parameters:

Name	Type	Default	Description
config	dict or None	None	New configuration dictionary. May include a new model `path`, index parameters, or any other valid embeddings configuration.
function	callable or None	None	Optional function to prepare content for indexing. Receives the document stream from the database and returns a transformed stream.
**kwargs	dict	{}	Additional configuration as keyword arguments. Merged with `config`.

Returns: None (modifies the embeddings instance in place).

Example:

from txtai import Embeddings

# Create and index with the default model
embeddings = Embeddings(content=True)
embeddings.index([
    (0, "Machine learning is a branch of AI", None),
    (1, "Natural language processing analyzes text", None),
    (2, "Computer vision processes images", None)
])

# Reindex with a fine-tuned model
embeddings.reindex({"path": "path/to/finetuned-model"})

# Reindex with a new model and keyword arguments
embeddings.reindex(path="sentence-transformers/all-mpnet-base-v2")

# Reindex with a transform function
def transform(documents):
    for uid, text, tags in documents:
        yield (uid, text.upper(), tags)

embeddings.reindex({"path": "new-model"}, function=transform)

Execution flow:

Guard check -- Returns immediately if self.database is None (content storage not enabled).
Merge configuration -- Combines config and kwargs into a single dictionary.
Preserve content settings -- Forces config["content"] to match the current self.config["content"] value. If "objects" is in the current config, it is also preserved. This ensures the document database is never lost during reindexing.
Reconfigure -- Calls self.configure(config) to reload the vector model, scoring, and query model based on the new configuration.
Reset functions -- If self.functions exists (custom SQL functions), calls self.functions.reset() to clear stale references.
Reindex documents -- Reads all documents from the database via self.database.reindex(self.config). If a function is provided, it is applied to the document stream. The result is passed to self.index(..., reindex=True), which rebuilds the vector index without recreating the database.

Important behaviors:

The reindex=True flag passed to self.index() skips database creation, preserving the existing document store.
All dense vectors (ANN index), sparse vectors (scoring index), subindexes, and graph networks are rebuilt from scratch.
The dimensionality reduction model (PCA/LSA) is also rebuilt if configured.
If the new model produces vectors of a different dimensionality, the dimensions config parameter is automatically updated.

Embeddings.configure

def configure(self, config)

Sets the configuration for the embeddings index and loads configuration-driven models. Called by both __init__ and reindex.

Parameters:

Name	Type	Description
config	dict or None	Embeddings configuration

Returns: None

Behavior:

Sets self.config to the provided configuration.
Resets the dimensionality reducer to None.
Creates a scoring instance if scoring is in the config and is not a sparse index type.
Loads the dense vector model via self.loadvectors() if config is set.
Loads the query model via self.loadquery() if config is set.

Source

src/python/txtai/embeddings/base.py (lines 30-83 for __init__)
src/python/txtai/embeddings/base.py (lines 260-290 for reindex)

Import

from txtai import Embeddings

Complete Workflow Example

The following example demonstrates the full model training and integration workflow:

from txtai import Embeddings
from txtai.pipeline import HFTrainer, HFOnnx

# Step 1: Prepare training data
train = [
    {"text": "positive review", "label": 1},
    {"text": "negative review", "label": 0}
]

# Step 2: Fine-tune a model
trainer = HFTrainer()
model, tokenizer = trainer(
    "sentence-transformers/all-MiniLM-L6-v2",
    train,
    task="text-classification",
    num_train_epochs=3
)

# Step 3: Optionally export to ONNX
onnx = HFOnnx()
onnx((model, tokenizer), task="default", output="models/finetuned.onnx")

# Step 4: Create embeddings with content storage
embeddings = Embeddings(path="sentence-transformers/all-MiniLM-L6-v2", content=True)
embeddings.index([
    (0, "document one", None),
    (1, "document two", None)
])

# Step 5: Reindex with the fine-tuned model
embeddings.reindex(path="models/finetuned")

# Step 6: Search with the new model
results = embeddings.search("query text", limit=5)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

API

Embeddings.__init__

Embeddings.reindex

Embeddings.configure

Source

Import

Complete Workflow Example

See Also

Page Connections

Embeddings.init