Principle:Run llama Llama index Job Status Monitoring
Overview
Job Status Monitoring addresses the challenge of tracking asynchronous finetuning operations. When a finetuning job is launched via the OpenAI API, it enters a queue and proceeds through multiple states before completion. Since finetuning can take minutes to hours depending on dataset size and model complexity, applications need a reliable way to poll for status updates and determine when the finetuned model is ready for use.
LlamaIndex provides the get_current_job() method on the finetuning engine to retrieve the latest job status from OpenAI. This method returns the full FineTuningJob object from the OpenAI Python SDK, giving access to all job metadata including status, model ID, timestamps, and error information.
Job Lifecycle States
An OpenAI finetuning job progresses through the following states:
| State | Description | Terminal? |
|---|---|---|
validating_files |
OpenAI is validating the uploaded training file | No |
queued |
Job is waiting in the training queue | No |
running |
Model training is actively in progress | No |
succeeded |
Training completed successfully; model is available | Yes |
failed |
Training failed due to an error | Yes |
cancelled |
Job was manually cancelled | Yes |
Only the succeeded state produces a usable finetuned model. Both failed and cancelled are terminal states where the model ID will be None.
Polling Strategy
Since the OpenAI finetuning API is asynchronous, monitoring requires periodic polling:
import time
from llama_index.finetuning import OpenAIFinetuneEngine
engine = OpenAIFinetuneEngine(
base_model="gpt-3.5-turbo",
data_path="training_data.jsonl",
)
engine.finetune()
# Poll until completion
while True:
job = engine.get_current_job()
print(f"Status: {job.status}")
if job.status in ("succeeded", "failed", "cancelled"):
break
time.sleep(60)
if job.status == "succeeded":
print(f"Finetuned model: {job.fine_tuned_model}")
else:
print(f"Job ended with status: {job.status}")
Note that OpenAI also sends an email notification when a job completes, so polling is useful for programmatic workflows but not strictly necessary for interactive use.
FineTuningJob Object
The FineTuningJob object returned by get_current_job() contains rich metadata:
| Attribute | Type | Description |
|---|---|---|
id |
str |
The unique job identifier (e.g., "ftjob-abc123")
|
status |
str |
Current job state (see lifecycle states above) |
fine_tuned_model |
Optional[str] |
The model ID of the finetuned model (only set when succeeded)
|
model |
str |
The base model being finetuned |
created_at |
int |
Unix timestamp of job creation |
finished_at |
Optional[int] |
Unix timestamp of job completion |
error |
Optional[Error] |
Error details if the job failed |
trained_tokens |
Optional[int] |
Number of tokens used in training |
Key Considerations
- Job must be launched first: Calling
get_current_job()beforefinetune()(and without providingstart_job_id) raises aValueError - Reconnecting to existing jobs: Use the
start_job_idconstructor parameter to monitor a job launched in a previous session - Rate limiting: Avoid polling too frequently; intervals of 30-60 seconds are reasonable to avoid hitting API rate limits
- Error handling: Always check for
failedstatus and inspect theerrorattribute for diagnostic information