Principle:Run llama Llama index Job Status Monitoring

Overview

Job Status Monitoring addresses the challenge of tracking asynchronous finetuning operations. When a finetuning job is launched via the OpenAI API, it enters a queue and proceeds through multiple states before completion. Since finetuning can take minutes to hours depending on dataset size and model complexity, applications need a reliable way to poll for status updates and determine when the finetuned model is ready for use.

LlamaIndex provides the get_current_job() method on the finetuning engine to retrieve the latest job status from OpenAI. This method returns the full FineTuningJob object from the OpenAI Python SDK, giving access to all job metadata including status, model ID, timestamps, and error information.

Job Lifecycle States

An OpenAI finetuning job progresses through the following states:

State	Description	Terminal?
`validating_files`	OpenAI is validating the uploaded training file	No
`queued`	Job is waiting in the training queue	No
`running`	Model training is actively in progress	No
`succeeded`	Training completed successfully; model is available	Yes
`failed`	Training failed due to an error	Yes
`cancelled`	Job was manually cancelled	Yes

Only the succeeded state produces a usable finetuned model. Both failed and cancelled are terminal states where the model ID will be None.

Polling Strategy

Since the OpenAI finetuning API is asynchronous, monitoring requires periodic polling:

import time
from llama_index.finetuning import OpenAIFinetuneEngine

engine = OpenAIFinetuneEngine(
    base_model="gpt-3.5-turbo",
    data_path="training_data.jsonl",
)
engine.finetune()

# Poll until completion
while True:
    job = engine.get_current_job()
    print(f"Status: {job.status}")
    if job.status in ("succeeded", "failed", "cancelled"):
        break
    time.sleep(60)

if job.status == "succeeded":
    print(f"Finetuned model: {job.fine_tuned_model}")
else:
    print(f"Job ended with status: {job.status}")

Note that OpenAI also sends an email notification when a job completes, so polling is useful for programmatic workflows but not strictly necessary for interactive use.

FineTuningJob Object

The FineTuningJob object returned by get_current_job() contains rich metadata:

Attribute	Type	Description
`id`	`str`	The unique job identifier (e.g., `"ftjob-abc123"`)
`status`	`str`	Current job state (see lifecycle states above)
`fine_tuned_model`	`Optional[str]`	The model ID of the finetuned model (only set when `succeeded`)
`model`	`str`	The base model being finetuned
`created_at`	`int`	Unix timestamp of job creation
`finished_at`	`Optional[int]`	Unix timestamp of job completion
`error`	`Optional[Error]`	Error details if the job failed
`trained_tokens`	`Optional[int]`	Number of tokens used in training

Key Considerations

Job must be launched first: Calling get_current_job() before finetune() (and without providing start_job_id) raises a ValueError
Reconnecting to existing jobs: Use the start_job_id constructor parameter to monitor a job launched in a previous session
Rate limiting: Avoid polling too frequently; intervals of 30-60 seconds are reasonable to avoid hitting API rate limits
Error handling: Always check for failed status and inspect the error attribute for diagnostic information

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment