Implementation:Neuml Txtai Embeddings Reindex
Overview
This page documents the Embeddings.__init__ constructor and the Embeddings.reindex method, which together provide the mechanism for creating embeddings indexes with custom models and rebuilding existing indexes when the underlying model changes. These are the primary integration points for connecting fine-tuned or exported models to the txtai search pipeline.
API
Embeddings.__init__
def __init__(self, config=None, models=None, **kwargs)
Creates a new embeddings index. Embeddings indexes are thread-safe for read operations but writes must be synchronized. The constructor initializes all internal components to None and applies the provided configuration.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
| config | dict or None | None | Embeddings configuration dictionary. Supports all txtai configuration keys including path, content, scoring, graph, and others.
|
| models | dict or None | None | Models cache for sharing loaded models between multiple embeddings instances (used with subindexes). |
| **kwargs | dict | {} | Additional configuration as keyword arguments. Merged with config if both are provided.
|
Returns: None
Key configuration parameters for model integration:
| Parameter | Type | Description |
|---|---|---|
| path | str | Path to the embedding model. Accepts a Hugging Face model hub identifier or local directory path. |
| content | bool or str | Enables document content storage. Required for reindexing. |
| dimensions | int | Number of embedding dimensions (auto-detected during indexing). |
| scoring | str or dict | Sparse scoring configuration (e.g., "bm25" or a dict with method and parameters).
|
| graph | bool or dict | Enables graph network storage. |
| indexes | dict | Subindex configurations for multi-index setups. |
Initialization sequence:
- All internal components are set to
None:config,reducer,model,ann,ids,database,functions,graph,scoring,query,archive,indexes. - The
modelscache is stored for shared model access. - If both
configandkwargsare provided, they are merged into a single dictionary (kwargsvalues take precedence). self.configure(config)is called to load configuration-driven models (vectors model, scoring, query model).
Example:
from txtai import Embeddings
# Create with default model (sentence-transformers/all-MiniLM-L6-v2)
embeddings = Embeddings()
# Create with a custom fine-tuned model
embeddings = Embeddings(path="path/to/finetuned-model")
# Create with content storage enabled (required for reindex)
embeddings = Embeddings(path="path/to/finetuned-model", content=True)
# Create with a dict configuration
config = {
"path": "path/to/finetuned-model",
"content": True,
"scoring": {"method": "bm25", "terms": True}
}
embeddings = Embeddings(config)
# Configuration via kwargs
embeddings = Embeddings(path="sentence-transformers/all-MiniLM-L6-v2", content=True)
Embeddings.reindex
def reindex(self, config=None, function=None, **kwargs)
Recreates the embeddings index using a new configuration. This method only works if document content storage is enabled (content=True), because the original document text must be available to regenerate vectors with the new model.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
| config | dict or None | None | New configuration dictionary. May include a new model path, index parameters, or any other valid embeddings configuration.
|
| function | callable or None | None | Optional function to prepare content for indexing. Receives the document stream from the database and returns a transformed stream. |
| **kwargs | dict | {} | Additional configuration as keyword arguments. Merged with config.
|
Returns: None (modifies the embeddings instance in place).
Example:
from txtai import Embeddings
# Create and index with the default model
embeddings = Embeddings(content=True)
embeddings.index([
(0, "Machine learning is a branch of AI", None),
(1, "Natural language processing analyzes text", None),
(2, "Computer vision processes images", None)
])
# Reindex with a fine-tuned model
embeddings.reindex({"path": "path/to/finetuned-model"})
# Reindex with a new model and keyword arguments
embeddings.reindex(path="sentence-transformers/all-mpnet-base-v2")
# Reindex with a transform function
def transform(documents):
for uid, text, tags in documents:
yield (uid, text.upper(), tags)
embeddings.reindex({"path": "new-model"}, function=transform)
Execution flow:
- Guard check -- Returns immediately if
self.databaseisNone(content storage not enabled). - Merge configuration -- Combines
configandkwargsinto a single dictionary. - Preserve content settings -- Forces
config["content"]to match the currentself.config["content"]value. If"objects"is in the current config, it is also preserved. This ensures the document database is never lost during reindexing. - Reconfigure -- Calls
self.configure(config)to reload the vector model, scoring, and query model based on the new configuration. - Reset functions -- If
self.functionsexists (custom SQL functions), callsself.functions.reset()to clear stale references. - Reindex documents -- Reads all documents from the database via
self.database.reindex(self.config). If afunctionis provided, it is applied to the document stream. The result is passed toself.index(..., reindex=True), which rebuilds the vector index without recreating the database.
Important behaviors:
- The
reindex=Trueflag passed toself.index()skips database creation, preserving the existing document store. - All dense vectors (ANN index), sparse vectors (scoring index), subindexes, and graph networks are rebuilt from scratch.
- The dimensionality reduction model (PCA/LSA) is also rebuilt if configured.
- If the new model produces vectors of a different dimensionality, the
dimensionsconfig parameter is automatically updated.
Embeddings.configure
def configure(self, config)
Sets the configuration for the embeddings index and loads configuration-driven models. Called by both __init__ and reindex.
Parameters:
| Name | Type | Description |
|---|---|---|
| config | dict or None | Embeddings configuration |
Returns: None
Behavior:
- Sets
self.configto the provided configuration. - Resets the dimensionality reducer to
None. - Creates a scoring instance if
scoringis in the config and is not a sparse index type. - Loads the dense vector model via
self.loadvectors()if config is set. - Loads the query model via
self.loadquery()if config is set.
Source
src/python/txtai/embeddings/base.py(lines 30-83 for__init__)src/python/txtai/embeddings/base.py(lines 260-290 forreindex)
Import
from txtai import Embeddings
Complete Workflow Example
The following example demonstrates the full model training and integration workflow:
from txtai import Embeddings
from txtai.pipeline import HFTrainer, HFOnnx
# Step 1: Prepare training data
train = [
{"text": "positive review", "label": 1},
{"text": "negative review", "label": 0}
]
# Step 2: Fine-tune a model
trainer = HFTrainer()
model, tokenizer = trainer(
"sentence-transformers/all-MiniLM-L6-v2",
train,
task="text-classification",
num_train_epochs=3
)
# Step 3: Optionally export to ONNX
onnx = HFOnnx()
onnx((model, tokenizer), task="default", output="models/finetuned.onnx")
# Step 4: Create embeddings with content storage
embeddings = Embeddings(path="sentence-transformers/all-MiniLM-L6-v2", content=True)
embeddings.index([
(0, "document one", None),
(1, "document two", None)
])
# Step 5: Reindex with the fine-tuned model
embeddings.reindex(path="models/finetuned")
# Step 6: Search with the new model
results = embeddings.search("query text", limit=5)
See Also
- Neuml_Txtai_Model_Integration -- Principle: model integration theory and reindexing concepts
- Neuml_Txtai_HFTrainer_Call -- Training models for integration
- Neuml_Txtai_HFOnnx_Call -- Exporting models to ONNX before integration