Principle:Neuml Txtai Model Integration
Overview
After a model has been fine-tuned or exported, it must be integrated back into the application's search and retrieval pipeline. In txtai, this means replacing the embedding model used by an Embeddings index and reindexing the existing document collection with the new model. The reindex operation regenerates all embedding vectors using the updated model while preserving the stored document content and metadata.
Model Integration
Model integration is the process of connecting a newly trained or exported model to an existing embeddings index. The fundamental challenge is that embedding vectors are model-specific -- vectors generated by one model are not comparable to vectors generated by a different model. Therefore, when the model changes, all existing vectors must be regenerated.
Why Reindexing Is Necessary
Embedding models map text to points in a high-dimensional vector space. The geometry of this space is determined entirely by the model's learned weights. Two different models will map the same text to different regions of their respective vector spaces. As a result:
- Similarity scores become meaningless if the query is encoded by a new model but the index contains vectors from the old model.
- Search quality degrades because the geometric relationships between documents no longer reflect the new model's understanding of semantic similarity.
- Dimensionality may differ between models, making it physically impossible to compare old and new vectors.
Reindexing regenerates every vector in the index using the new model, ensuring that all vectors inhabit the same vector space and that similarity computations are valid.
Embedding Model Replacement
The Embeddings class accepts a path parameter in its configuration that specifies which model to use for generating vectors. Changing this path to point to a fine-tuned model is the primary mechanism for model replacement.
Configuration-Based Replacement
When creating a new Embeddings instance, the path parameter can be set to:
- A Hugging Face model hub identifier (e.g.,
"sentence-transformers/all-MiniLM-L6-v2"). - A local directory path containing a saved model (e.g., the output of
HFTrainer). - A custom model path pointing to a fine-tuned or ONNX-exported model.
The embeddings instance automatically loads the appropriate vector model based on this configuration. When the configuration changes, the model is reloaded.
Default Model Behavior
If no model path is specified and no sparse-only configuration is active, txtai defaults to loading sentence-transformers/all-MiniLM-L6-v2. This ensures that embeddings work out of the box for common use cases while still allowing easy replacement with a custom model.
Index Reindexing
The reindex method on the Embeddings class is the mechanism for regenerating vectors after a model change. This method requires content storage to be enabled (content=True in the configuration), because the original document text must be available to generate new vectors.
Reindex Workflow
- Receive new configuration -- The caller provides an updated configuration dictionary that may include a new model path, new index parameters, or both.
- Preserve content settings -- The
contentandobjectsparameters from the current configuration are automatically carried forward to ensure the document database is preserved. - Reconfigure the embeddings instance -- The internal model, scoring, and query components are reloaded based on the new configuration.
- Reset function references -- If the index uses custom SQL functions, they are reset to reflect the new configuration.
- Regenerate vectors -- All documents are read from the database and re-encoded with the new model. The index is rebuilt from scratch using the new vectors.
Optional Transform Function
The reindex method accepts an optional function parameter that can transform the document content before re-encoding. This is useful for:
- Reformatting text -- Changing how document fields are combined into the indexed text.
- Adding metadata -- Enriching documents with additional information before reindexing.
- Filtering -- Excluding certain documents from the reindexed collection.
When provided, this function is applied to the stream of documents read from the database, and its output is passed to the indexing pipeline.
Integration Patterns
Pattern 1: Train and Replace
The most common pattern involves training a new model and immediately using it:
- Train a model using
HFTrainer, which returns a(model, tokenizer)tuple. - Save the model to a local directory.
- Create a new
Embeddingsinstance withpathset to the saved model directory. - Index documents with the new embeddings instance.
Pattern 2: Train, Export, and Replace
For production deployments where inference speed matters:
- Train a model using
HFTrainer. - Export the model to ONNX using
HFOnnx. - Create a new
Embeddingsinstance pointing to the ONNX model. - Index documents.
Pattern 3: In-Place Reindex
For existing indexes that need a model upgrade:
- Load an existing
Embeddingsindex that has content storage enabled. - Call
reindex()with a new configuration specifying the new model path. - The existing documents are automatically re-encoded and the index is rebuilt.
Architectural Considerations
Content Storage Requirement
Reindexing is only possible when the original document content is stored alongside the index. Without content storage, the embeddings index only contains vectors and ID mappings -- the original text is not available for re-encoding. This is why content=True is a prerequisite for the reindex method.
Component Preservation
During reindexing, the document database is preserved while the vector index (ANN), scoring index, subindexes, and graph are rebuilt from scratch. This ensures that document content and metadata are never lost during a model change.
Sparse Index Interaction
If the embeddings configuration includes a sparse scoring index (e.g., BM25), it is also rebuilt during reindexing. The sparse index operates independently of the dense vector model, but it shares the same document collection and must be consistent with the dense index.
See Also
- Neuml_Txtai_Embeddings_Reindex -- Implementation details for Embeddings.__init__ and reindex
- Neuml_Txtai_Model_Fine_Tuning -- Fine-tuning models for integration
- Neuml_Txtai_ONNX_Export -- Exporting models before integration