Principle:Run llama Llama index Embedding Model Integration
Overview
Embedding Model Integration covers how finetuned embedding models are plugged into LlamaIndex RAG (Retrieval-Augmented Generation) pipelines. This involves assigning the finetuned model to the global Settings object, understanding the embedding type system, and evaluating the retrieval quality improvement that finetuning provides.
Concept: Swapping Default Embeddings with Domain-Specific Ones
LlamaIndex uses a global Settings singleton to manage the default embedding model. After finetuning, the key integration step is replacing the default model with the finetuned one:
from llama_index.core import Settings
# Before: default embedding (e.g., OpenAI text-embedding-ada-002)
# After: assign finetuned model
Settings.embed_model = finetuned_embed_model
Once assigned, all LlamaIndex components that rely on embeddings -- index construction, query engines, retrievers -- automatically use the finetuned model without requiring individual configuration changes.
Concept: The Settings Singleton Pattern
LlamaIndex's Settings object uses a singleton pattern with lazy initialization:
- Lazy default -- If no embedding model is set, the first access triggers loading of the default model
- Global scope -- One
Settingsinstance serves the entire application - Type resolution -- The setter accepts both
BaseEmbeddinginstances and string identifiers
This design simplifies integration because you set the model once, and it propagates everywhere.
Concept: EmbedType and Type Resolution
The Settings.embed_model property accepts EmbedType, which is defined as:
EmbedType = Union[BaseEmbedding, str]
This means you can assign:
- A BaseEmbedding instance -- Direct assignment of the finetuned model object
- A string identifier -- e.g.,
"local:finetuned_model", which is resolved viaresolve_embed_model()
When a string is provided, the setter calls resolve_embed_model(embed_model) to convert it to a BaseEmbedding instance before storing it.
Concept: Integration Points in RAG Pipelines
The finetuned embedding model affects multiple stages of a RAG pipeline:
| Stage | How Embeddings Are Used | Impact of Finetuning |
|---|---|---|
| Index Construction | Documents are embedded and stored in a vector store | Better document representations for the target domain |
| Query Embedding | User queries are embedded for similarity search | Queries map more accurately to relevant domain documents |
| Retrieval | Cosine similarity between query and document embeddings | Improved retrieval precision and recall for domain-specific queries |
| Reranking | Some rerankers use embedding similarity | More meaningful similarity scores for domain content |
Important: If the index was built with a different embedding model, it must be rebuilt with the finetuned model. Mixing embedding models between indexing and querying produces poor results because the vector spaces are not aligned.
Concept: Evaluating Retrieval Improvement
After integrating a finetuned model, evaluation is critical to confirm improvement. Common evaluation approaches:
Hit Rate
The fraction of queries where the correct document appears in the top-k retrieved results:
# For each query in the evaluation set:
# 1. Retrieve top-k documents using the finetuned embedding
# 2. Check if the ground truth document is in the result set
# hit_rate = correct_retrievals / total_queries
Mean Reciprocal Rank (MRR)
Measures the average rank of the first correct result:
MRR = (1/|Q|) * sum(1/rank_i)
where rank_i is the position of the first relevant document for query i.
A/B Comparison
Compare the finetuned model against the base model on the same evaluation queries:
from llama_index.core import VectorStoreIndex, Settings
# Evaluate with base model
Settings.embed_model = base_embed_model
base_index = VectorStoreIndex.from_documents(documents)
base_retriever = base_index.as_retriever(similarity_top_k=5)
# Evaluate with finetuned model
Settings.embed_model = finetuned_embed_model
ft_index = VectorStoreIndex.from_documents(documents)
ft_retriever = ft_index.as_retriever(similarity_top_k=5)
# Compare retrieval results on evaluation queries
Concept: Per-Component Override
While Settings.embed_model sets the global default, you can also pass embedding models directly to individual components:
# Global default
Settings.embed_model = general_embed_model
# Per-index override with finetuned model
index = VectorStoreIndex.from_documents(
documents,
embed_model=finetuned_embed_model,
)
This is useful when different parts of your application need different embedding models (e.g., general vs. domain-specific).
Concept: End-to-End Finetuning Workflow
The complete embedding finetuning and integration workflow:
- Prepare documents -- Load and chunk documents into TextNodes
- Generate QA pairs -- Use
generate_qa_embedding_pairs()with an LLM - Configure engine -- Create a
SentenceTransformersFinetuneEnginewith training data - Execute finetuning -- Call
engine.finetune() - Load finetuned model -- Call
engine.get_finetuned_model() - Integrate -- Assign to
Settings.embed_model - Rebuild index -- Reconstruct the vector index with the new embedding model
- Evaluate -- Compare retrieval quality against the baseline
Knowledge Sources
LlamaIndex Embedding Finetuning Guide LlamaIndex Settings Configuration
Metadata
Machine Learning Embeddings RAG LlamaIndex
Implementation:Run_llama_Llama_index_Settings_Embed_Model_Assignment