Principle:Run llama Llama index Finetuned Model Retrieval
Overview
Finetuned Model Retrieval is the step that bridges finetuning job completion and actual model usage. Once a finetuning job succeeds, OpenAI assigns a unique model ID to the resulting model (e.g., ft:gpt-3.5-turbo:org-name::job-id). This model ID must be extracted from the completed job and wrapped in an LLM object that can be used throughout the LlamaIndex pipeline. The get_finetuned_model() method handles this entire process, returning a ready-to-use OpenAI LLM instance configured with the finetuned model.
From Job ID to Usable LLM
The retrieval process involves several conceptual steps:
| Step | Action | Detail |
|---|---|---|
| 1 | Retrieve latest job status | API call to get current FineTuningJob state
|
| 2 | Validate job completion | Verify status is "succeeded" and model ID is available
|
| 3 | Extract model ID | Read fine_tuned_model from the job object
|
| 4 | Create LLM wrapper | Instantiate OpenAI(model=model_id, **kwargs)
|
This sequence ensures that the returned LLM object is fully configured and ready for inference. The method raises clear errors if the job has not completed or has failed, preventing silent failures in downstream pipeline components.
Model ID Format
OpenAI finetuned model IDs follow a specific naming convention:
# Format: ft:{base_model}:{org}::{suffix}
# Example: ft:gpt-3.5-turbo-0613:my-org::7p4lURel
This model ID is what you would use directly with the OpenAI API for inference. LlamaIndex wraps this ID in its OpenAI LLM class, which handles prompt formatting, streaming, retries, and all other LLM interaction patterns consistently with any other model.
Configuring the Finetuned Model
The get_finetuned_model() method accepts arbitrary keyword arguments that are passed through to the OpenAI LLM constructor, allowing fine-grained control over inference behavior:
# Get with default parameters
ft_llm = engine.get_finetuned_model()
# Get with custom parameters
ft_llm = engine.get_finetuned_model(
temperature=0.3,
max_tokens=512,
)
Common parameters to configure include:
- temperature: Controls response randomness (lower for more deterministic, finetuned models often work well with lower temperatures)
- max_tokens: Maximum response length
- additional_kwargs: Any other parameters supported by the OpenAI Chat Completions API
Error Handling
The method performs two validation checks before returning the model:
- Model ID not ready: If
fine_tuned_modelisNone, the job may still be running or may have failed without producing a model. AValueErroris raised with the job ID for debugging. - Job not succeeded: If the status is anything other than
"succeeded", the model cannot be used. AValueErroris raised with the current status for diagnosis.
These guards prevent the common mistake of trying to use a model from an incomplete or failed job.
Workflow Context
The retrieval step fits into the broader finetuning workflow as follows:
from llama_index.finetuning import OpenAIFinetuneEngine
# Launch and wait for job completion
engine = OpenAIFinetuneEngine(
base_model="gpt-3.5-turbo",
data_path="training_data.jsonl",
)
engine.finetune()
# Wait for completion (polling or email notification)
import time
while engine.get_current_job().status not in ("succeeded", "failed"):
time.sleep(60)
# Retrieve the finetuned model
ft_llm = engine.get_finetuned_model(temperature=0.3)
# Use it like any other LLM
response = ft_llm.complete("What is the purpose of vector databases?")
print(response)
Key Considerations
- Job must be completed: Always verify the job status is
"succeeded"before callingget_finetuned_model() - Model availability delay: Even after a job succeeds, there may be a brief delay before the model is available for inference
- LLM type compatibility: The returned object is an
OpenAILLM instance, fully compatible with all LlamaIndex components that accept anLLMtype - Iterative finetuning: The returned model ID can itself be used as a
base_modelfor subsequent finetuning rounds