Principle:BerriAI Litellm Job Progress Monitoring
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| Asynchronous Job Management Patterns, Polling Strategies, Lifecycle State Machines | Distributed Systems, API Design, Workflow Orchestration | 2026-02-15 |
Overview
Job progress monitoring is the practice of tracking the state and lifecycle of asynchronous fine-tuning jobs through polling, listing, retrieval, and cancellation operations.
Description
Fine-tuning jobs are long-running asynchronous operations that may take minutes to hours to complete. Once a job is created, the caller does not receive immediate results; instead, the provider returns a job object with an initial status (typically "validating_files" or "queued") and a unique job identifier. Progress monitoring encompasses the set of operations needed to observe, inspect, and manage these jobs throughout their lifecycle.
The three core monitoring operations are:
- List: Retrieve a paginated collection of all fine-tuning jobs for the organization, allowing discovery and bulk status checking.
- Retrieve: Fetch the current state of a specific job by its identifier, providing detailed information including status, error messages, and the resulting fine-tuned model name upon completion.
- Cancel: Immediately terminate a running or queued job, useful when training is taking too long, hyperparameters were misconfigured, or the training data was incorrect.
These operations must work across multiple providers, each with their own endpoint schemas and authentication requirements, while presenting a unified interface to the caller.
Usage
Job progress monitoring should be employed when:
- A fine-tuning job has been created and the caller needs to know when it completes.
- An automated pipeline needs to poll for job completion before proceeding to model deployment.
- An administrator needs to review all active and historical fine-tuning jobs.
- A running job needs to be cancelled due to errors, cost concerns, or changed requirements.
- Debugging a failed job requires retrieving detailed error information.
Theoretical Basis
Job Lifecycle State Machine
Fine-tuning jobs progress through a well-defined set of states:
+--------------------+
| validating_files |
+--------------------+
|
v
+--------------------+
| queued |
+--------------------+
|
+--------+--------+
| |
v v
+-----------+ +-----------+
| running | | cancelled |
+-----------+ +-----------+
|
+--------+--------+
| |
v v
+-----------+ +-----------+
| succeeded | | failed |
+-----------+ +-----------+
Terminal states are succeeded, failed, and cancelled. A job in any non-terminal state may be cancelled. Only a succeeded job produces a usable fine-tuned model identifier.
Polling Strategy
Since providers do not typically offer webhook-based notifications for fine-tuning job completion, callers must implement a polling strategy:
FUNCTION poll_until_complete(job_id, provider, interval, max_attempts):
attempts = 0
WHILE attempts < max_attempts:
job = retrieve_fine_tuning_job(job_id, provider)
IF job.status IN ["succeeded", "failed", "cancelled"]:
RETURN job
WAIT interval seconds
attempts = attempts + 1
RAISE timeout error
RECOMMENDED:
interval = 30 to 60 seconds
max_attempts = based on expected training duration
Key polling considerations:
- Backoff: Increase the polling interval over time to reduce API call overhead for long-running jobs.
- Rate limits: Respect provider rate limits to avoid throttling.
- Error handling: Distinguish between transient API errors (retry) and permanent job failures (surface error).
Pagination for Listing
The list operation supports pagination through two parameters:
- after: A cursor (job ID) indicating the position after which to return results. Used for forward pagination.
- limit: The maximum number of results to return per page (provider defaults typically around 20).
FUNCTION list_all_jobs(provider):
all_jobs = []
cursor = None
LOOP:
page = list_fine_tuning_jobs(after=cursor, limit=20, provider=provider)
all_jobs.EXTEND(page.data)
IF page has no more results:
BREAK
cursor = last job ID in page.data
RETURN all_jobs
Cancellation Semantics
Cancellation is an idempotent, best-effort operation. The provider will attempt to stop the job, but some work may have already been performed. The returned job object reflects the updated status (typically "cancelled"). Cancelling a job that has already reached a terminal state is typically a no-op or returns the current state.