Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Finetuned Embedding Loading

From Leeroopedia
Revision as of 17:35, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Run_llama_Llama_index_Finetuned_Embedding_Loading.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Finetuned Embedding Loading covers the process of loading a finetuned embedding model back into memory after training and wrapping it in LlamaIndex's BaseEmbedding interface. This step bridges the gap between the training output (a saved model on disk) and the LlamaIndex embedding API that powers retrieval pipelines.

Concept: Model Serialization and Deserialization

After finetuning, the trained model must be serialized to disk and later deserialized for inference. The serialization format depends on the finetuning approach:

Sentence Transformers Full Model

A fully finetuned Sentence Transformer model is saved as a complete model directory containing:

  • config.json -- Model architecture configuration
  • pytorch_model.bin or model.safetensors -- Model weights
  • tokenizer_config.json, vocab.txt, etc. -- Tokenizer files
  • sentence_bert_config.json -- Sentence Transformers-specific configuration

This is a self-contained model that can be loaded by any Sentence Transformers-compatible loader.

Adapter Weights

An adapter-based finetuned model saves only the adapter layer weights:

  • The adapter weights file (e.g., a PyTorch state dict)
  • The base model is not saved -- it must be available separately at load time

This results in a much smaller saved artifact, but requires the base model to be accessible.

Concept: Wrapping in the BaseEmbedding Interface

LlamaIndex uses the BaseEmbedding abstract class as the standard interface for all embedding models. A finetuned model must be wrapped in a class that implements this interface to be used in LlamaIndex pipelines.

The BaseEmbedding interface requires:

  • _get_text_embedding(text: str) -> List[float] -- Embed a single document
  • _get_query_embedding(query: str) -> List[float] -- Embed a single query
  • _get_text_embedding_batch(texts: List[str]) -> List[List[float]] -- Batch document embedding

Concept: Full Model Loading (Sentence Transformers)

For Sentence Transformers finetuning, the loading process uses the "local:" prefix convention:

  1. The get_finetuned_model() method constructs the string "local:{model_output_path}"
  2. This string is passed to resolve_embed_model(), which recognizes the "local:" prefix
  3. The resolver loads the model from the local directory and wraps it in LlamaIndex's embedding interface

This approach ensures the finetuned model is treated identically to any other embedding model in the system.

Concept: Adapter Model Loading

For adapter-based finetuning, the loading process is different:

  1. The get_finetuned_model() method creates an AdapterEmbeddingModel
  2. This model combines the original base embedding model with the trained adapter weights
  3. At inference time, inputs are first embedded by the base model, then transformed by the adapter

The AdapterEmbeddingModel composes both models transparently, presenting a single BaseEmbedding interface to consumers.

Concept: Model Resolution

LlamaIndex's resolve_embed_model() function handles various model identifier formats:

Prefix Behavior
local:{path} Loads a Sentence Transformer model from a local directory
<model_name> Loads from HuggingFace Hub
"default" Loads the default embedding model (OpenAI text-embedding-ada-002)

The finetuned model loading uses the "local:" prefix to point to the output directory where the finetuned model was saved during training.

Concept: Full Model vs Adapter Comparison

Aspect Full Model (Sentence Transformers) Adapter Model
Saved artifact Complete model directory (hundreds of MB) Adapter weights only (KB to MB)
Load time dependency Self-contained, no external model needed Requires the original base model
Inference overhead Same as any Sentence Transformer model Base model inference + adapter forward pass
Portability Fully portable Must ship base model + adapter
LlamaIndex wrapper resolve_embed_model("local:path") AdapterEmbeddingModel(base_model, adapter_path)

Knowledge Sources

LlamaIndex Embedding Finetuning Guide Sentence Transformers Saving and Loading

Metadata

Machine Learning Embeddings Finetuning Model Serialization LlamaIndex

Implementation:Run_llama_Llama_index_EmbeddingFinetuneEngine_Get_Model

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment