Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Togethercomputer Together python FineTuning Monitoring

From Leeroopedia
Attribute Value
Implementation Name FineTuning_Monitoring
Type API Methods (multiple)
Source src/together/resources/finetune.py:L640-775
Domain MLOps, Fine_Tuning
Repository togethercomputer/together-python
Last Updated 2026-02-15 16:00 GMT

API Signatures

FineTuning.retrieve

def retrieve(self, id: str) -> FinetuneResponse:

Retrieves fine-tune job details.

FineTuning.list_events

def list_events(self, id: str) -> FinetuneListEvents:

Lists training events (metrics per step) for a fine-tune job.

FineTuning.cancel

def cancel(self, id: str) -> FinetuneResponse:

Cancels a running fine-tuning job.

FineTuning.list_checkpoints

def list_checkpoints(self, id: str) -> List[FinetuneCheckpoint]:

Lists available checkpoints for a fine-tuning job.

FineTuning.list

def list(self) -> FinetuneList:

Lists all fine-tuning job history.

FineTuning.delete

def delete(self, id: str, force: bool = False) -> FinetuneDeleteResponse:

Deletes a fine-tuning job record.

Import

from together import Together

client = Together()

# Retrieve job details
job = client.fine_tuning.retrieve(id="ft-...")

# List training events
events = client.fine_tuning.list_events(id="ft-...")

# Cancel a job
cancelled_job = client.fine_tuning.cancel(id="ft-...")

# List checkpoints
checkpoints = client.fine_tuning.list_checkpoints(id="ft-...")

# List all jobs
all_jobs = client.fine_tuning.list()

# Delete a job
client.fine_tuning.delete(id="ft-...", force=False)

I/O Contract

FineTuning.retrieve

Parameter Type Description
id str Fine-tune job ID (starts with "ft-").

Returns: FinetuneResponse -- Contains job configuration and status fields including id, status, model, training_file, output_name, training_type, and hyperparameter details.

FineTuning.list_events

Parameter Type Description
id str Fine-tune job ID (starts with "ft-").

Returns: FinetuneListEvents -- Contains a list of FinetuneEvent objects. Each event typically includes step number, training loss, learning rate, and other per-step metrics.

FineTuning.cancel

Parameter Type Description
id str Fine-tune job ID (starts with "ft-").

Returns: FinetuneResponse -- The updated job details reflecting the cancelled status.

FineTuning.list_checkpoints

Parameter Type Description
id str Fine-tune job ID.

Returns: List[FinetuneCheckpoint] -- Each checkpoint has:

  • type (str) -- Checkpoint type (e.g., "Intermediate", "Final").
  • timestamp (str) -- Creation timestamp.
  • name (str) -- Checkpoint identifier. For intermediate checkpoints: "ft-id:step". For the final checkpoint: "ft-id".

Checkpoints are sorted by timestamp in descending order (most recent first).

FineTuning.delete

Parameter Type Default Description
id str (required) Fine-tune job ID.
force bool False Force deletion.

Returns: FinetuneDeleteResponse -- Deletion confirmation message.

Code Reference

retrieve (L640-665)

def retrieve(self, id: str) -> FinetuneResponse:
    requestor = api_requestor.APIRequestor(client=self._client)
    response, _, _ = requestor.request(
        options=TogetherRequest(method="GET", url=f"fine-tunes/{id}"),
        stream=False,
    )
    assert isinstance(response, TogetherResponse)
    return FinetuneResponse(**response.data)

list_events (L725-749)

def list_events(self, id: str) -> FinetuneListEvents:
    requestor = api_requestor.APIRequestor(client=self._client)
    response, _, _ = requestor.request(
        options=TogetherRequest(method="GET", url=f"fine-tunes/{id}/events"),
        stream=False,
    )
    assert isinstance(response, TogetherResponse)
    return FinetuneListEvents(**response.data)

cancel (L667-692)

def cancel(self, id: str) -> FinetuneResponse:
    requestor = api_requestor.APIRequestor(client=self._client)
    response, _, _ = requestor.request(
        options=TogetherRequest(method="POST", url=f"fine-tunes/{id}/cancel"),
        stream=False,
    )
    assert isinstance(response, TogetherResponse)
    return FinetuneResponse(**response.data)

list_checkpoints (L751-775)

def list_checkpoints(self, id: str) -> List[FinetuneCheckpoint]:
    requestor = api_requestor.APIRequestor(client=self._client)
    response, _, _ = requestor.request(
        options=TogetherRequest(method="GET", url=f"fine-tunes/{id}/checkpoints"),
        stream=False,
    )
    assert isinstance(response, TogetherResponse)
    raw_checkpoints = response.data["data"]
    return _parse_raw_checkpoints(raw_checkpoints, id)

The _parse_raw_checkpoints() helper (L297-328) processes raw checkpoint metadata, formatting the name field as "ft-id:step" for intermediate checkpoints and "ft-id" for the final checkpoint, and sorts by timestamp in descending order.

Usage Examples

Polling for Job Completion

import time
from together import Together

client = Together()
job_id = "ft-12345678-abcd-1234-efgh-123456789012"

while True:
    job = client.fine_tuning.retrieve(job_id)
    print(f"Status: {job.status}")

    if job.status in ("completed", "failed", "cancelled"):
        break

    time.sleep(60)  # Poll every 60 seconds

if job.status == "completed":
    print(f"Model ready: {job.output_name}")
else:
    print(f"Job ended with status: {job.status}")

Monitoring Training Loss

from together import Together

client = Together()
job_id = "ft-12345678-abcd-1234-efgh-123456789012"

events = client.fine_tuning.list_events(job_id)
for event in events.data:
    print(event)

Listing Checkpoints and Choosing One

from together import Together

client = Together()
job_id = "ft-12345678-abcd-1234-efgh-123456789012"

checkpoints = client.fine_tuning.list_checkpoints(job_id)
for cp in checkpoints:
    print(f"  Type: {cp.type}, Name: {cp.name}, Time: {cp.timestamp}")

# Use a specific checkpoint for download
if checkpoints:
    selected = checkpoints[0]  # Most recent
    print(f"Selected checkpoint: {selected.name}")

Cancelling a Diverging Job

from together import Together

client = Together()
job_id = "ft-12345678-abcd-1234-efgh-123456789012"

# Cancel the job
result = client.fine_tuning.cancel(job_id)
print(f"Cancelled job: {result.id}, status: {result.status}")

Listing All Jobs

from together import Together

client = Together()

all_jobs = client.fine_tuning.list()
for job in all_jobs.data:
    print(f"Job: {job.id}, Model: {job.model}, Status: {job.status}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment