Principle:Togethercomputer Together python Batch Job Monitoring
| Attribute | Value |
|---|---|
| Type | Principle |
| Domains | Batch_Processing, Inference, API_Client |
| Repository | togethercomputer/together-python |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Pattern for polling batch job status and managing running batch inference jobs.
Description
Batch job monitoring provides APIs to check individual job status, list all batch jobs, and cancel running jobs. The BatchJob object tracks status transitions, progress percentage, and provides the output_file_id when the job completes.
Three monitoring operations are available:
- Get a specific batch job -- Retrieve the current state of a single batch job by its ID.
- List all batch jobs -- Retrieve all batch jobs associated with the account.
- Cancel a batch job -- Request cancellation of a running or queued batch job.
The status field on the BatchJob object follows a defined state machine:
VALIDATING -> IN_PROGRESS -> COMPLETED | FAILED | EXPIRED
Cancellation introduces an intermediate state: CANCELING -> CANCELLED.
Usage
Use this principle after creating a batch job to poll for completion. The typical monitoring workflow is:
- Call
Batches.get_batch(batch_job_id)periodically to check thestatusfield. - Monitor the
progressfloat (0.0 to 1.0) for incremental progress tracking. - When
statusreachesCOMPLETED, extract theoutput_file_idto download results. - If
statusisFAILED, check theerrorfield and optionally theerror_file_idfor details. - Use
Batches.list_batches()to get an overview of all batch jobs and their states. - Use
Batches.cancel_batch(batch_job_id)to abort a job that is no longer needed.
Theoretical Basis
Batch job monitoring implements the polling pattern for asynchronous job tracking. Since batch jobs may take hours to complete, the client must periodically query for status updates rather than maintaining a persistent connection.
Key monitoring considerations:
- Poll interval: Clients should implement an appropriate backoff strategy (e.g., polling every 30-60 seconds) to avoid excessive API calls.
- Terminal states: The states
COMPLETED,FAILED,EXPIRED, andCANCELLEDare terminal -- once reached, the job state will not change. - Progress tracking: The
progressfield (0.0 to 1.0) provides a coarse estimate of completion percentage, useful for user-facing progress indicators. - Output availability: The
output_file_idis only populated when the job reachesCOMPLETEDstatus. Attempting to retrieve results before completion will yield no output file ID.