Principle:Run llama Llama index Finetuned Model Retrieval

Overview

Finetuned Model Retrieval is the step that bridges finetuning job completion and actual model usage. Once a finetuning job succeeds, OpenAI assigns a unique model ID to the resulting model (e.g., ft:gpt-3.5-turbo:org-name::job-id). This model ID must be extracted from the completed job and wrapped in an LLM object that can be used throughout the LlamaIndex pipeline. The get_finetuned_model() method handles this entire process, returning a ready-to-use OpenAI LLM instance configured with the finetuned model.

From Job ID to Usable LLM

The retrieval process involves several conceptual steps:

Step	Action	Detail
1	Retrieve latest job status	API call to get current `FineTuningJob` state
2	Validate job completion	Verify status is `"succeeded"` and model ID is available
3	Extract model ID	Read `fine_tuned_model` from the job object
4	Create LLM wrapper	Instantiate `OpenAI(model=model_id, **kwargs)`

This sequence ensures that the returned LLM object is fully configured and ready for inference. The method raises clear errors if the job has not completed or has failed, preventing silent failures in downstream pipeline components.

Model ID Format

OpenAI finetuned model IDs follow a specific naming convention:

# Format: ft:{base_model}:{org}::{suffix}
# Example: ft:gpt-3.5-turbo-0613:my-org::7p4lURel

This model ID is what you would use directly with the OpenAI API for inference. LlamaIndex wraps this ID in its OpenAI LLM class, which handles prompt formatting, streaming, retries, and all other LLM interaction patterns consistently with any other model.

Configuring the Finetuned Model

The get_finetuned_model() method accepts arbitrary keyword arguments that are passed through to the OpenAI LLM constructor, allowing fine-grained control over inference behavior:

# Get with default parameters
ft_llm = engine.get_finetuned_model()

# Get with custom parameters
ft_llm = engine.get_finetuned_model(
    temperature=0.3,
    max_tokens=512,
)

Common parameters to configure include:

temperature: Controls response randomness (lower for more deterministic, finetuned models often work well with lower temperatures)
max_tokens: Maximum response length
additional_kwargs: Any other parameters supported by the OpenAI Chat Completions API

Error Handling

The method performs two validation checks before returning the model:

Model ID not ready: If fine_tuned_model is None, the job may still be running or may have failed without producing a model. A ValueError is raised with the job ID for debugging.
Job not succeeded: If the status is anything other than "succeeded", the model cannot be used. A ValueError is raised with the current status for diagnosis.

These guards prevent the common mistake of trying to use a model from an incomplete or failed job.

Workflow Context

The retrieval step fits into the broader finetuning workflow as follows:

from llama_index.finetuning import OpenAIFinetuneEngine

# Launch and wait for job completion
engine = OpenAIFinetuneEngine(
    base_model="gpt-3.5-turbo",
    data_path="training_data.jsonl",
)
engine.finetune()

# Wait for completion (polling or email notification)
import time
while engine.get_current_job().status not in ("succeeded", "failed"):
    time.sleep(60)

# Retrieve the finetuned model
ft_llm = engine.get_finetuned_model(temperature=0.3)

# Use it like any other LLM
response = ft_llm.complete("What is the purpose of vector databases?")
print(response)

Key Considerations

Job must be completed: Always verify the job status is "succeeded" before calling get_finetuned_model()
Model availability delay: Even after a job succeeds, there may be a brief delay before the model is available for inference
LLM type compatibility: The returned object is an OpenAI LLM instance, fully compatible with all LlamaIndex components that accept an LLM type
Iterative finetuning: The returned model ID can itself be used as a base_model for subsequent finetuning rounds

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment