Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Togethercomputer Together python Batch Job Monitoring

From Leeroopedia
Attribute Value
Type Principle
Domains Batch_Processing, Inference, API_Client
Repository togethercomputer/together-python
Last Updated 2026-02-15 16:00 GMT

Overview

Pattern for polling batch job status and managing running batch inference jobs.

Description

Batch job monitoring provides APIs to check individual job status, list all batch jobs, and cancel running jobs. The BatchJob object tracks status transitions, progress percentage, and provides the output_file_id when the job completes.

Three monitoring operations are available:

  • Get a specific batch job -- Retrieve the current state of a single batch job by its ID.
  • List all batch jobs -- Retrieve all batch jobs associated with the account.
  • Cancel a batch job -- Request cancellation of a running or queued batch job.

The status field on the BatchJob object follows a defined state machine:

VALIDATING -> IN_PROGRESS -> COMPLETED | FAILED | EXPIRED

Cancellation introduces an intermediate state: CANCELING -> CANCELLED.

Usage

Use this principle after creating a batch job to poll for completion. The typical monitoring workflow is:

  • Call Batches.get_batch(batch_job_id) periodically to check the status field.
  • Monitor the progress float (0.0 to 1.0) for incremental progress tracking.
  • When status reaches COMPLETED, extract the output_file_id to download results.
  • If status is FAILED, check the error field and optionally the error_file_id for details.
  • Use Batches.list_batches() to get an overview of all batch jobs and their states.
  • Use Batches.cancel_batch(batch_job_id) to abort a job that is no longer needed.

Theoretical Basis

Batch job monitoring implements the polling pattern for asynchronous job tracking. Since batch jobs may take hours to complete, the client must periodically query for status updates rather than maintaining a persistent connection.

Key monitoring considerations:

  • Poll interval: Clients should implement an appropriate backoff strategy (e.g., polling every 30-60 seconds) to avoid excessive API calls.
  • Terminal states: The states COMPLETED, FAILED, EXPIRED, and CANCELLED are terminal -- once reached, the job state will not change.
  • Progress tracking: The progress field (0.0 to 1.0) provides a coarse estimate of completion percentage, useful for user-facing progress indicators.
  • Output availability: The output_file_id is only populated when the job reaches COMPLETED status. Attempting to retrieve results before completion will yield no output file ID.

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment