Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Job Status Monitoring

From Leeroopedia

Overview

Job Status Monitoring addresses the challenge of tracking asynchronous finetuning operations. When a finetuning job is launched via the OpenAI API, it enters a queue and proceeds through multiple states before completion. Since finetuning can take minutes to hours depending on dataset size and model complexity, applications need a reliable way to poll for status updates and determine when the finetuned model is ready for use.

LlamaIndex provides the get_current_job() method on the finetuning engine to retrieve the latest job status from OpenAI. This method returns the full FineTuningJob object from the OpenAI Python SDK, giving access to all job metadata including status, model ID, timestamps, and error information.

Job Lifecycle States

An OpenAI finetuning job progresses through the following states:

State Description Terminal?
validating_files OpenAI is validating the uploaded training file No
queued Job is waiting in the training queue No
running Model training is actively in progress No
succeeded Training completed successfully; model is available Yes
failed Training failed due to an error Yes
cancelled Job was manually cancelled Yes

Only the succeeded state produces a usable finetuned model. Both failed and cancelled are terminal states where the model ID will be None.

Polling Strategy

Since the OpenAI finetuning API is asynchronous, monitoring requires periodic polling:

import time
from llama_index.finetuning import OpenAIFinetuneEngine

engine = OpenAIFinetuneEngine(
    base_model="gpt-3.5-turbo",
    data_path="training_data.jsonl",
)
engine.finetune()

# Poll until completion
while True:
    job = engine.get_current_job()
    print(f"Status: {job.status}")
    if job.status in ("succeeded", "failed", "cancelled"):
        break
    time.sleep(60)

if job.status == "succeeded":
    print(f"Finetuned model: {job.fine_tuned_model}")
else:
    print(f"Job ended with status: {job.status}")

Note that OpenAI also sends an email notification when a job completes, so polling is useful for programmatic workflows but not strictly necessary for interactive use.

FineTuningJob Object

The FineTuningJob object returned by get_current_job() contains rich metadata:

Attribute Type Description
id str The unique job identifier (e.g., "ftjob-abc123")
status str Current job state (see lifecycle states above)
fine_tuned_model Optional[str] The model ID of the finetuned model (only set when succeeded)
model str The base model being finetuned
created_at int Unix timestamp of job creation
finished_at Optional[int] Unix timestamp of job completion
error Optional[Error] Error details if the job failed
trained_tokens Optional[int] Number of tokens used in training

Key Considerations

  • Job must be launched first: Calling get_current_job() before finetune() (and without providing start_job_id) raises a ValueError
  • Reconnecting to existing jobs: Use the start_job_id constructor parameter to monitor a job launched in a previous session
  • Rate limiting: Avoid polling too frequently; intervals of 30-60 seconds are reasonable to avoid hitting API rate limits
  • Error handling: Always check for failed status and inspect the error attribute for diagnostic information

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment