Principle:Run llama Llama index Embedding Model Integration

Overview

Embedding Model Integration covers how finetuned embedding models are plugged into LlamaIndex RAG (Retrieval-Augmented Generation) pipelines. This involves assigning the finetuned model to the global Settings object, understanding the embedding type system, and evaluating the retrieval quality improvement that finetuning provides.

Concept: Swapping Default Embeddings with Domain-Specific Ones

LlamaIndex uses a global Settings singleton to manage the default embedding model. After finetuning, the key integration step is replacing the default model with the finetuned one:

from llama_index.core import Settings

# Before: default embedding (e.g., OpenAI text-embedding-ada-002)
# After: assign finetuned model
Settings.embed_model = finetuned_embed_model

Once assigned, all LlamaIndex components that rely on embeddings -- index construction, query engines, retrievers -- automatically use the finetuned model without requiring individual configuration changes.

Concept: The Settings Singleton Pattern

LlamaIndex's Settings object uses a singleton pattern with lazy initialization:

Lazy default -- If no embedding model is set, the first access triggers loading of the default model
Global scope -- One Settings instance serves the entire application
Type resolution -- The setter accepts both BaseEmbedding instances and string identifiers

This design simplifies integration because you set the model once, and it propagates everywhere.

Concept: EmbedType and Type Resolution

The Settings.embed_model property accepts EmbedType, which is defined as:

EmbedType = Union[BaseEmbedding, str]

This means you can assign:

A BaseEmbedding instance -- Direct assignment of the finetuned model object
A string identifier -- e.g., "local:finetuned_model", which is resolved via resolve_embed_model()

When a string is provided, the setter calls resolve_embed_model(embed_model) to convert it to a BaseEmbedding instance before storing it.

Concept: Integration Points in RAG Pipelines

The finetuned embedding model affects multiple stages of a RAG pipeline:

Stage	How Embeddings Are Used	Impact of Finetuning
Index Construction	Documents are embedded and stored in a vector store	Better document representations for the target domain
Query Embedding	User queries are embedded for similarity search	Queries map more accurately to relevant domain documents
Retrieval	Cosine similarity between query and document embeddings	Improved retrieval precision and recall for domain-specific queries
Reranking	Some rerankers use embedding similarity	More meaningful similarity scores for domain content

Important: If the index was built with a different embedding model, it must be rebuilt with the finetuned model. Mixing embedding models between indexing and querying produces poor results because the vector spaces are not aligned.

Concept: Evaluating Retrieval Improvement

After integrating a finetuned model, evaluation is critical to confirm improvement. Common evaluation approaches:

Hit Rate

The fraction of queries where the correct document appears in the top-k retrieved results:

# For each query in the evaluation set:
# 1. Retrieve top-k documents using the finetuned embedding
# 2. Check if the ground truth document is in the result set
# hit_rate = correct_retrievals / total_queries

Mean Reciprocal Rank (MRR)

Measures the average rank of the first correct result:

MRR = (1/|Q|) * sum(1/rank_i)

where rank_i is the position of the first relevant document for query i.

A/B Comparison

Compare the finetuned model against the base model on the same evaluation queries:

from llama_index.core import VectorStoreIndex, Settings

# Evaluate with base model
Settings.embed_model = base_embed_model
base_index = VectorStoreIndex.from_documents(documents)
base_retriever = base_index.as_retriever(similarity_top_k=5)

# Evaluate with finetuned model
Settings.embed_model = finetuned_embed_model
ft_index = VectorStoreIndex.from_documents(documents)
ft_retriever = ft_index.as_retriever(similarity_top_k=5)

# Compare retrieval results on evaluation queries

Concept: Per-Component Override

While Settings.embed_model sets the global default, you can also pass embedding models directly to individual components:

# Global default
Settings.embed_model = general_embed_model

# Per-index override with finetuned model
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=finetuned_embed_model,
)

This is useful when different parts of your application need different embedding models (e.g., general vs. domain-specific).

Concept: End-to-End Finetuning Workflow

The complete embedding finetuning and integration workflow:

Prepare documents -- Load and chunk documents into TextNodes
Generate QA pairs -- Use generate_qa_embedding_pairs() with an LLM
Configure engine -- Create a SentenceTransformersFinetuneEngine with training data
Execute finetuning -- Call engine.finetune()
Load finetuned model -- Call engine.get_finetuned_model()
Integrate -- Assign to Settings.embed_model
Rebuild index -- Reconstruct the vector index with the new embedding model
Evaluate -- Compare retrieval quality against the baseline

Knowledge Sources

LlamaIndex Embedding Finetuning Guide LlamaIndex Settings Configuration

Metadata

Machine Learning Embeddings RAG LlamaIndex

Implementation:Run_llama_Llama_index_Settings_Embed_Model_Assignment

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment