Principle:Run llama Llama index Finetuned Embedding Loading
Overview
Finetuned Embedding Loading covers the process of loading a finetuned embedding model back into memory after training and wrapping it in LlamaIndex's BaseEmbedding interface. This step bridges the gap between the training output (a saved model on disk) and the LlamaIndex embedding API that powers retrieval pipelines.
Concept: Model Serialization and Deserialization
After finetuning, the trained model must be serialized to disk and later deserialized for inference. The serialization format depends on the finetuning approach:
Sentence Transformers Full Model
A fully finetuned Sentence Transformer model is saved as a complete model directory containing:
config.json-- Model architecture configurationpytorch_model.binormodel.safetensors-- Model weightstokenizer_config.json,vocab.txt, etc. -- Tokenizer filessentence_bert_config.json-- Sentence Transformers-specific configuration
This is a self-contained model that can be loaded by any Sentence Transformers-compatible loader.
Adapter Weights
An adapter-based finetuned model saves only the adapter layer weights:
- The adapter weights file (e.g., a PyTorch state dict)
- The base model is not saved -- it must be available separately at load time
This results in a much smaller saved artifact, but requires the base model to be accessible.
Concept: Wrapping in the BaseEmbedding Interface
LlamaIndex uses the BaseEmbedding abstract class as the standard interface for all embedding models. A finetuned model must be wrapped in a class that implements this interface to be used in LlamaIndex pipelines.
The BaseEmbedding interface requires:
_get_text_embedding(text: str) -> List[float]-- Embed a single document_get_query_embedding(query: str) -> List[float]-- Embed a single query_get_text_embedding_batch(texts: List[str]) -> List[List[float]]-- Batch document embedding
Concept: Full Model Loading (Sentence Transformers)
For Sentence Transformers finetuning, the loading process uses the "local:" prefix convention:
- The
get_finetuned_model()method constructs the string"local:{model_output_path}" - This string is passed to
resolve_embed_model(), which recognizes the"local:"prefix - The resolver loads the model from the local directory and wraps it in LlamaIndex's embedding interface
This approach ensures the finetuned model is treated identically to any other embedding model in the system.
Concept: Adapter Model Loading
For adapter-based finetuning, the loading process is different:
- The
get_finetuned_model()method creates anAdapterEmbeddingModel - This model combines the original base embedding model with the trained adapter weights
- At inference time, inputs are first embedded by the base model, then transformed by the adapter
The AdapterEmbeddingModel composes both models transparently, presenting a single BaseEmbedding interface to consumers.
Concept: Model Resolution
LlamaIndex's resolve_embed_model() function handles various model identifier formats:
| Prefix | Behavior |
|---|---|
| local:{path} | Loads a Sentence Transformer model from a local directory |
| <model_name> | Loads from HuggingFace Hub |
| "default" | Loads the default embedding model (OpenAI text-embedding-ada-002) |
The finetuned model loading uses the "local:" prefix to point to the output directory where the finetuned model was saved during training.
Concept: Full Model vs Adapter Comparison
| Aspect | Full Model (Sentence Transformers) | Adapter Model |
|---|---|---|
| Saved artifact | Complete model directory (hundreds of MB) | Adapter weights only (KB to MB) |
| Load time dependency | Self-contained, no external model needed | Requires the original base model |
| Inference overhead | Same as any Sentence Transformer model | Base model inference + adapter forward pass |
| Portability | Fully portable | Must ship base model + adapter |
| LlamaIndex wrapper | resolve_embed_model("local:path") |
AdapterEmbeddingModel(base_model, adapter_path)
|
Knowledge Sources
LlamaIndex Embedding Finetuning Guide Sentence Transformers Saving and Loading
Metadata
Machine Learning Embeddings Finetuning Model Serialization LlamaIndex
Implementation:Run_llama_Llama_index_EmbeddingFinetuneEngine_Get_Model