Principle:Run llama Llama index Finetuned Embedding Loading

Overview

Finetuned Embedding Loading covers the process of loading a finetuned embedding model back into memory after training and wrapping it in LlamaIndex's BaseEmbedding interface. This step bridges the gap between the training output (a saved model on disk) and the LlamaIndex embedding API that powers retrieval pipelines.

Concept: Model Serialization and Deserialization

After finetuning, the trained model must be serialized to disk and later deserialized for inference. The serialization format depends on the finetuning approach:

Sentence Transformers Full Model

A fully finetuned Sentence Transformer model is saved as a complete model directory containing:

config.json -- Model architecture configuration
pytorch_model.bin or model.safetensors -- Model weights
tokenizer_config.json, vocab.txt, etc. -- Tokenizer files
sentence_bert_config.json -- Sentence Transformers-specific configuration

This is a self-contained model that can be loaded by any Sentence Transformers-compatible loader.

Adapter Weights

An adapter-based finetuned model saves only the adapter layer weights:

The adapter weights file (e.g., a PyTorch state dict)
The base model is not saved -- it must be available separately at load time

This results in a much smaller saved artifact, but requires the base model to be accessible.

Concept: Wrapping in the BaseEmbedding Interface

LlamaIndex uses the BaseEmbedding abstract class as the standard interface for all embedding models. A finetuned model must be wrapped in a class that implements this interface to be used in LlamaIndex pipelines.

The BaseEmbedding interface requires:

_get_text_embedding(text: str) -> List[float] -- Embed a single document
_get_query_embedding(query: str) -> List[float] -- Embed a single query
_get_text_embedding_batch(texts: List[str]) -> List[List[float]] -- Batch document embedding

Concept: Full Model Loading (Sentence Transformers)

For Sentence Transformers finetuning, the loading process uses the "local:" prefix convention:

The get_finetuned_model() method constructs the string "local:{model_output_path}"
This string is passed to resolve_embed_model(), which recognizes the "local:" prefix
The resolver loads the model from the local directory and wraps it in LlamaIndex's embedding interface

This approach ensures the finetuned model is treated identically to any other embedding model in the system.

Concept: Adapter Model Loading

For adapter-based finetuning, the loading process is different:

The get_finetuned_model() method creates an AdapterEmbeddingModel
This model combines the original base embedding model with the trained adapter weights
At inference time, inputs are first embedded by the base model, then transformed by the adapter

The AdapterEmbeddingModel composes both models transparently, presenting a single BaseEmbedding interface to consumers.

Concept: Model Resolution

LlamaIndex's resolve_embed_model() function handles various model identifier formats:

Prefix	Behavior
local:{path}	Loads a Sentence Transformer model from a local directory
<model_name>	Loads from HuggingFace Hub
"default"	Loads the default embedding model (OpenAI text-embedding-ada-002)

The finetuned model loading uses the "local:" prefix to point to the output directory where the finetuned model was saved during training.

Concept: Full Model vs Adapter Comparison

Aspect	Full Model (Sentence Transformers)	Adapter Model
Saved artifact	Complete model directory (hundreds of MB)	Adapter weights only (KB to MB)
Load time dependency	Self-contained, no external model needed	Requires the original base model
Inference overhead	Same as any Sentence Transformer model	Base model inference + adapter forward pass
Portability	Fully portable	Must ship base model + adapter
LlamaIndex wrapper	`resolve_embed_model("local:path")`	`AdapterEmbeddingModel(base_model, adapter_path)`

Knowledge Sources

LlamaIndex Embedding Finetuning Guide Sentence Transformers Saving and Loading

Metadata

Machine Learning Embeddings Finetuning Model Serialization LlamaIndex

Implementation:Run_llama_Llama_index_EmbeddingFinetuneEngine_Get_Model

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment