Principle:Run llama Llama index Finetuned Model Integration
Overview
Finetuned Model Integration is the final step in the LLM finetuning workflow, where the newly trained model is plugged into the application pipeline to replace or augment the original model. LlamaIndex's Settings singleton provides a centralized configuration point that enables hot-swapping the LLM across the entire application with a single assignment. This design pattern means that switching from a base model to a finetuned model requires no changes to query engines, response synthesizers, or any other pipeline components.
The integration step is deliberately simple by design -- the finetuned model is a drop-in replacement because it implements the same LLM interface as every other model in LlamaIndex. This interchangeability is fundamental to the framework's composability and enables powerful patterns like A/B testing and gradual rollouts.
Hot-Swapping via Settings Singleton
The Settings object is a module-level singleton (an instance of _Settings dataclass) that serves as the global configuration store for LlamaIndex. The llm property provides getter/setter access to the globally configured LLM:
from llama_index.core import Settings
from llama_index.finetuning import OpenAIFinetuneEngine
# Before: using base model
# Settings.llm defaults to the resolved default LLM
# After finetuning: swap in the finetuned model
engine = OpenAIFinetuneEngine(
base_model="gpt-3.5-turbo",
data_path="training_data.jsonl",
start_job_id="ftjob-abc123",
)
ft_llm = engine.get_finetuned_model(temperature=0.3)
Settings.llm = ft_llm
# All subsequent pipeline operations now use the finetuned model
response = query_engine.query("What is retrieval augmented generation?")
The key insight is that no pipeline components need to be recreated. Any query engine, chat engine, or response synthesizer that relies on Settings.llm will automatically use the new model on its next invocation.
The LLMType Union
The Settings.llm setter accepts values of type LLMType = Union[LLM, str], providing flexibility in how the LLM is specified:
| Input Type | Example | Behavior |
|---|---|---|
LLM instance |
Settings.llm = ft_llm |
Directly assigns the LLM object |
str model name |
Settings.llm = "gpt-4" |
Calls resolve_llm() to create the appropriate LLM instance
|
For finetuned model integration, you will typically pass the LLM instance returned by get_finetuned_model(), which is already configured with the finetuned model ID and any custom inference parameters.
A/B Testing Pattern
The drop-in replacement nature of finetuned models enables straightforward A/B testing between the original and finetuned models:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
# Original model
original_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
# Finetuned model
ft_llm = engine.get_finetuned_model(temperature=0.3)
# Test with original
Settings.llm = original_llm
response_original = query_engine.query("What is RAG?")
# Test with finetuned
Settings.llm = ft_llm
response_finetuned = query_engine.query("What is RAG?")
# Compare responses
print("Original:", response_original)
print("Finetuned:", response_finetuned)
This pattern is especially valuable during the evaluation phase, where you want to measure whether the finetuned model actually improves response quality for your specific use case.
Component-Level Override
While Settings.llm provides global configuration, individual components can also be configured with their own LLM instance, overriding the global setting:
from llama_index.core import VectorStoreIndex
# Global setting uses original model
Settings.llm = original_llm
# This specific query engine uses the finetuned model
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=ft_llm)
This enables gradual migration strategies where specific pipeline stages use the finetuned model while others retain the original.
Lazy Initialization
The Settings.llm getter implements lazy initialization -- if no LLM has been explicitly set, it calls resolve_llm("default") to create a default LLM on first access. The getter also propagates the global callback manager to the LLM, ensuring that any monitoring or logging handlers are automatically applied.
Key Considerations
- Singleton scope:
Settingsis process-global; changing it affects all components in the same process - Thread safety: The Settings singleton is not thread-safe; in multi-threaded applications, prefer component-level LLM configuration
- Callback propagation: The Settings getter automatically assigns the global callback manager to the LLM, ensuring observability is maintained after swapping models
- Cost implications: Finetuned models may have different per-token pricing than base models; monitor usage after integration
- Inference latency: Finetuned models generally have the same latency as their base model since they run on the same infrastructure