Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Embeddings Reindex

From Leeroopedia


Overview

This page documents the Embeddings.__init__ constructor and the Embeddings.reindex method, which together provide the mechanism for creating embeddings indexes with custom models and rebuilding existing indexes when the underlying model changes. These are the primary integration points for connecting fine-tuned or exported models to the txtai search pipeline.

API

Embeddings.__init__

def __init__(self, config=None, models=None, **kwargs)

Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized. The constructor initializes all internal components to None and applies the provided configuration.

Parameters:

Name Type Default Description
config dict or None None Embeddings configuration dictionary. Supports all txtai configuration keys including path, content, scoring, graph, and others.
models dict or None None Models cache for sharing loaded models between multiple embeddings instances (used with subindexes).
**kwargs dict {} Additional configuration as keyword arguments. Merged with config if both are provided.

Returns: None

Key configuration parameters for model integration:

Parameter Type Description
path str Path to the embedding model. Accepts a Hugging Face model hub identifier or local directory path.
content bool or str Enables document content storage. Required for reindexing.
dimensions int Number of embedding dimensions (auto-detected during indexing).
scoring str or dict Sparse scoring configuration (e.g., "bm25" or a dict with method and parameters).
graph bool or dict Enables graph network storage.
indexes dict Subindex configurations for multi-index setups.

Initialization sequence:

  1. All internal components are set to None: config, reducer, model, ann, ids, database, functions, graph, scoring, query, archive, indexes.
  2. The models cache is stored for shared model access.
  3. If both config and kwargs are provided, they are merged into a single dictionary (kwargs values take precedence).
  4. self.configure(config) is called to load configuration-driven models (vectors model, scoring, query model).

Example:

from txtai import Embeddings

# Create with default model (sentence-transformers/all-MiniLM-L6-v2)
embeddings = Embeddings()

# Create with a custom fine-tuned model
embeddings = Embeddings(path="path/to/finetuned-model")

# Create with content storage enabled (required for reindex)
embeddings = Embeddings(path="path/to/finetuned-model", content=True)

# Create with a dict configuration
config = {
    "path": "path/to/finetuned-model",
    "content": True,
    "scoring": {"method": "bm25", "terms": True}
}
embeddings = Embeddings(config)

# Configuration via kwargs
embeddings = Embeddings(path="sentence-transformers/all-MiniLM-L6-v2", content=True)

Embeddings.reindex

def reindex(self, config=None, function=None, **kwargs)

Recreates the embeddings index using a new configuration. This method only works if document content storage is enabled (content=True), because the original document text must be available to regenerate vectors with the new model.

Parameters:

Name Type Default Description
config dict or None None New configuration dictionary. May include a new model path, index parameters, or any other valid embeddings configuration.
function callable or None None Optional function to prepare content for indexing. Receives the document stream from the database and returns a transformed stream.
**kwargs dict {} Additional configuration as keyword arguments. Merged with config.

Returns: None (modifies the embeddings instance in place).

Example:

from txtai import Embeddings

# Create and index with the default model
embeddings = Embeddings(content=True)
embeddings.index([
    (0, "Machine learning is a branch of AI", None),
    (1, "Natural language processing analyzes text", None),
    (2, "Computer vision processes images", None)
])

# Reindex with a fine-tuned model
embeddings.reindex({"path": "path/to/finetuned-model"})

# Reindex with a new model and keyword arguments
embeddings.reindex(path="sentence-transformers/all-mpnet-base-v2")

# Reindex with a transform function
def transform(documents):
    for uid, text, tags in documents:
        yield (uid, text.upper(), tags)

embeddings.reindex({"path": "new-model"}, function=transform)

Execution flow:

  1. Guard check -- Returns immediately if self.database is None (content storage not enabled).
  2. Merge configuration -- Combines config and kwargs into a single dictionary.
  3. Preserve content settings -- Forces config["content"] to match the current self.config["content"] value. If "objects" is in the current config, it is also preserved. This ensures the document database is never lost during reindexing.
  4. Reconfigure -- Calls self.configure(config) to reload the vector model, scoring, and query model based on the new configuration.
  5. Reset functions -- If self.functions exists (custom SQL functions), calls self.functions.reset() to clear stale references.
  6. Reindex documents -- Reads all documents from the database via self.database.reindex(self.config). If a function is provided, it is applied to the document stream. The result is passed to self.index(..., reindex=True), which rebuilds the vector index without recreating the database.

Important behaviors:

  • The reindex=True flag passed to self.index() skips database creation, preserving the existing document store.
  • All dense vectors (ANN index), sparse vectors (scoring index), subindexes, and graph networks are rebuilt from scratch.
  • The dimensionality reduction model (PCA/LSA) is also rebuilt if configured.
  • If the new model produces vectors of a different dimensionality, the dimensions config parameter is automatically updated.

Embeddings.configure

def configure(self, config)

Sets the configuration for the embeddings index and loads configuration-driven models. Called by both __init__ and reindex.

Parameters:

Name Type Description
config dict or None Embeddings configuration

Returns: None

Behavior:

  1. Sets self.config to the provided configuration.
  2. Resets the dimensionality reducer to None.
  3. Creates a scoring instance if scoring is in the config and is not a sparse index type.
  4. Loads the dense vector model via self.loadvectors() if config is set.
  5. Loads the query model via self.loadquery() if config is set.

Source

  • src/python/txtai/embeddings/base.py (lines 30-83 for __init__)
  • src/python/txtai/embeddings/base.py (lines 260-290 for reindex)

Import

from txtai import Embeddings

Complete Workflow Example

The following example demonstrates the full model training and integration workflow:

from txtai import Embeddings
from txtai.pipeline import HFTrainer, HFOnnx

# Step 1: Prepare training data
train = [
    {"text": "positive review", "label": 1},
    {"text": "negative review", "label": 0}
]

# Step 2: Fine-tune a model
trainer = HFTrainer()
model, tokenizer = trainer(
    "sentence-transformers/all-MiniLM-L6-v2",
    train,
    task="text-classification",
    num_train_epochs=3
)

# Step 3: Optionally export to ONNX
onnx = HFOnnx()
onnx((model, tokenizer), task="default", output="models/finetuned.onnx")

# Step 4: Create embeddings with content storage
embeddings = Embeddings(path="sentence-transformers/all-MiniLM-L6-v2", content=True)
embeddings.index([
    (0, "document one", None),
    (1, "document two", None)
])

# Step 5: Reindex with the fine-tuned model
embeddings.reindex(path="models/finetuned")

# Step 6: Search with the new model
results = embeddings.search("query text", limit=5)

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment