Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:BerriAI Litellm Job Progress Monitoring

From Leeroopedia
Knowledge Sources Domains Last Updated
Asynchronous Job Management Patterns, Polling Strategies, Lifecycle State Machines Distributed Systems, API Design, Workflow Orchestration 2026-02-15

Overview

Job progress monitoring is the practice of tracking the state and lifecycle of asynchronous fine-tuning jobs through polling, listing, retrieval, and cancellation operations.

Description

Fine-tuning jobs are long-running asynchronous operations that may take minutes to hours to complete. Once a job is created, the caller does not receive immediate results; instead, the provider returns a job object with an initial status (typically "validating_files" or "queued") and a unique job identifier. Progress monitoring encompasses the set of operations needed to observe, inspect, and manage these jobs throughout their lifecycle.

The three core monitoring operations are:

  • List: Retrieve a paginated collection of all fine-tuning jobs for the organization, allowing discovery and bulk status checking.
  • Retrieve: Fetch the current state of a specific job by its identifier, providing detailed information including status, error messages, and the resulting fine-tuned model name upon completion.
  • Cancel: Immediately terminate a running or queued job, useful when training is taking too long, hyperparameters were misconfigured, or the training data was incorrect.

These operations must work across multiple providers, each with their own endpoint schemas and authentication requirements, while presenting a unified interface to the caller.

Usage

Job progress monitoring should be employed when:

  • A fine-tuning job has been created and the caller needs to know when it completes.
  • An automated pipeline needs to poll for job completion before proceeding to model deployment.
  • An administrator needs to review all active and historical fine-tuning jobs.
  • A running job needs to be cancelled due to errors, cost concerns, or changed requirements.
  • Debugging a failed job requires retrieving detailed error information.

Theoretical Basis

Job Lifecycle State Machine

Fine-tuning jobs progress through a well-defined set of states:

                    +--------------------+
                    |  validating_files  |
                    +--------------------+
                             |
                             v
                    +--------------------+
                    |      queued        |
                    +--------------------+
                             |
                    +--------+--------+
                    |                 |
                    v                 v
           +-----------+     +-----------+
           |  running  |     | cancelled |
           +-----------+     +-----------+
                    |
           +--------+--------+
           |                 |
           v                 v
     +-----------+     +-----------+
     | succeeded |     |  failed   |
     +-----------+     +-----------+

Terminal states are succeeded, failed, and cancelled. A job in any non-terminal state may be cancelled. Only a succeeded job produces a usable fine-tuned model identifier.

Polling Strategy

Since providers do not typically offer webhook-based notifications for fine-tuning job completion, callers must implement a polling strategy:

FUNCTION poll_until_complete(job_id, provider, interval, max_attempts):
    attempts = 0
    WHILE attempts < max_attempts:
        job = retrieve_fine_tuning_job(job_id, provider)
        IF job.status IN ["succeeded", "failed", "cancelled"]:
            RETURN job
        WAIT interval seconds
        attempts = attempts + 1
    RAISE timeout error

RECOMMENDED:
    interval = 30 to 60 seconds
    max_attempts = based on expected training duration

Key polling considerations:

  • Backoff: Increase the polling interval over time to reduce API call overhead for long-running jobs.
  • Rate limits: Respect provider rate limits to avoid throttling.
  • Error handling: Distinguish between transient API errors (retry) and permanent job failures (surface error).

Pagination for Listing

The list operation supports pagination through two parameters:

  • after: A cursor (job ID) indicating the position after which to return results. Used for forward pagination.
  • limit: The maximum number of results to return per page (provider defaults typically around 20).
FUNCTION list_all_jobs(provider):
    all_jobs = []
    cursor = None
    LOOP:
        page = list_fine_tuning_jobs(after=cursor, limit=20, provider=provider)
        all_jobs.EXTEND(page.data)
        IF page has no more results:
            BREAK
        cursor = last job ID in page.data
    RETURN all_jobs

Cancellation Semantics

Cancellation is an idempotent, best-effort operation. The provider will attempt to stop the job, but some work may have already been performed. The returned job object reflects the updated status (typically "cancelled"). Cancelling a job that has already reached a terminal state is typically a no-op or returns the current state.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment