Implementation:Togethercomputer Together python FineTuning Monitoring
| Attribute | Value |
|---|---|
| Implementation Name | FineTuning_Monitoring |
| Type | API Methods (multiple) |
| Source | src/together/resources/finetune.py:L640-775 |
| Domain | MLOps, Fine_Tuning |
| Repository | togethercomputer/together-python |
| Last Updated | 2026-02-15 16:00 GMT |
API Signatures
FineTuning.retrieve
def retrieve(self, id: str) -> FinetuneResponse:
Retrieves fine-tune job details.
FineTuning.list_events
def list_events(self, id: str) -> FinetuneListEvents:
Lists training events (metrics per step) for a fine-tune job.
FineTuning.cancel
def cancel(self, id: str) -> FinetuneResponse:
Cancels a running fine-tuning job.
FineTuning.list_checkpoints
def list_checkpoints(self, id: str) -> List[FinetuneCheckpoint]:
Lists available checkpoints for a fine-tuning job.
FineTuning.list
def list(self) -> FinetuneList:
Lists all fine-tuning job history.
FineTuning.delete
def delete(self, id: str, force: bool = False) -> FinetuneDeleteResponse:
Deletes a fine-tuning job record.
Import
from together import Together
client = Together()
# Retrieve job details
job = client.fine_tuning.retrieve(id="ft-...")
# List training events
events = client.fine_tuning.list_events(id="ft-...")
# Cancel a job
cancelled_job = client.fine_tuning.cancel(id="ft-...")
# List checkpoints
checkpoints = client.fine_tuning.list_checkpoints(id="ft-...")
# List all jobs
all_jobs = client.fine_tuning.list()
# Delete a job
client.fine_tuning.delete(id="ft-...", force=False)
I/O Contract
FineTuning.retrieve
| Parameter | Type | Description |
|---|---|---|
id |
str |
Fine-tune job ID (starts with "ft-").
|
Returns: FinetuneResponse -- Contains job configuration and status fields including id, status, model, training_file, output_name, training_type, and hyperparameter details.
FineTuning.list_events
| Parameter | Type | Description |
|---|---|---|
id |
str |
Fine-tune job ID (starts with "ft-").
|
Returns: FinetuneListEvents -- Contains a list of FinetuneEvent objects. Each event typically includes step number, training loss, learning rate, and other per-step metrics.
FineTuning.cancel
| Parameter | Type | Description |
|---|---|---|
id |
str |
Fine-tune job ID (starts with "ft-").
|
Returns: FinetuneResponse -- The updated job details reflecting the cancelled status.
FineTuning.list_checkpoints
| Parameter | Type | Description |
|---|---|---|
id |
str |
Fine-tune job ID. |
Returns: List[FinetuneCheckpoint] -- Each checkpoint has:
type(str) -- Checkpoint type (e.g.,"Intermediate","Final").timestamp(str) -- Creation timestamp.name(str) -- Checkpoint identifier. For intermediate checkpoints:"ft-id:step". For the final checkpoint:"ft-id".
Checkpoints are sorted by timestamp in descending order (most recent first).
FineTuning.delete
| Parameter | Type | Default | Description |
|---|---|---|---|
id |
str |
(required) | Fine-tune job ID. |
force |
bool |
False |
Force deletion. |
Returns: FinetuneDeleteResponse -- Deletion confirmation message.
Code Reference
retrieve (L640-665)
def retrieve(self, id: str) -> FinetuneResponse:
requestor = api_requestor.APIRequestor(client=self._client)
response, _, _ = requestor.request(
options=TogetherRequest(method="GET", url=f"fine-tunes/{id}"),
stream=False,
)
assert isinstance(response, TogetherResponse)
return FinetuneResponse(**response.data)
list_events (L725-749)
def list_events(self, id: str) -> FinetuneListEvents:
requestor = api_requestor.APIRequestor(client=self._client)
response, _, _ = requestor.request(
options=TogetherRequest(method="GET", url=f"fine-tunes/{id}/events"),
stream=False,
)
assert isinstance(response, TogetherResponse)
return FinetuneListEvents(**response.data)
cancel (L667-692)
def cancel(self, id: str) -> FinetuneResponse:
requestor = api_requestor.APIRequestor(client=self._client)
response, _, _ = requestor.request(
options=TogetherRequest(method="POST", url=f"fine-tunes/{id}/cancel"),
stream=False,
)
assert isinstance(response, TogetherResponse)
return FinetuneResponse(**response.data)
list_checkpoints (L751-775)
def list_checkpoints(self, id: str) -> List[FinetuneCheckpoint]:
requestor = api_requestor.APIRequestor(client=self._client)
response, _, _ = requestor.request(
options=TogetherRequest(method="GET", url=f"fine-tunes/{id}/checkpoints"),
stream=False,
)
assert isinstance(response, TogetherResponse)
raw_checkpoints = response.data["data"]
return _parse_raw_checkpoints(raw_checkpoints, id)
The _parse_raw_checkpoints() helper (L297-328) processes raw checkpoint metadata, formatting the name field as "ft-id:step" for intermediate checkpoints and "ft-id" for the final checkpoint, and sorts by timestamp in descending order.
Usage Examples
Polling for Job Completion
import time
from together import Together
client = Together()
job_id = "ft-12345678-abcd-1234-efgh-123456789012"
while True:
job = client.fine_tuning.retrieve(job_id)
print(f"Status: {job.status}")
if job.status in ("completed", "failed", "cancelled"):
break
time.sleep(60) # Poll every 60 seconds
if job.status == "completed":
print(f"Model ready: {job.output_name}")
else:
print(f"Job ended with status: {job.status}")
Monitoring Training Loss
from together import Together
client = Together()
job_id = "ft-12345678-abcd-1234-efgh-123456789012"
events = client.fine_tuning.list_events(job_id)
for event in events.data:
print(event)
Listing Checkpoints and Choosing One
from together import Together
client = Together()
job_id = "ft-12345678-abcd-1234-efgh-123456789012"
checkpoints = client.fine_tuning.list_checkpoints(job_id)
for cp in checkpoints:
print(f" Type: {cp.type}, Name: {cp.name}, Time: {cp.timestamp}")
# Use a specific checkpoint for download
if checkpoints:
selected = checkpoints[0] # Most recent
print(f"Selected checkpoint: {selected.name}")
Cancelling a Diverging Job
from together import Together
client = Together()
job_id = "ft-12345678-abcd-1234-efgh-123456789012"
# Cancel the job
result = client.fine_tuning.cancel(job_id)
print(f"Cancelled job: {result.id}, status: {result.status}")
Listing All Jobs
from together import Together
client = Together()
all_jobs = client.fine_tuning.list()
for job in all_jobs.data:
print(f"Job: {job.id}, Model: {job.model}, Status: {job.status}")