Workflow:Togethercomputer Together python Fine Tuning

Knowledge Sources	Together Python Together API Docs Fine-Tuning Guide
Domains	LLMs, Fine_Tuning, Data_Engineering
Last Updated	2026-02-15 16:00 GMT

Overview

End-to-end process for fine-tuning language models on Together AI, from dataset validation and upload through job creation, monitoring, and model artifact download.

Description

This workflow covers the complete fine-tuning lifecycle using the Together Python SDK. It begins with local dataset validation to ensure data format correctness before uploading, proceeds through file upload (with automatic single vs. multipart routing for large files), job creation with extensive hyperparameter configuration (including LoRA, learning rate scheduling, and training method selection), job monitoring via event polling, and finally downloading the resulting model weights or checkpoints. The SDK supports SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RPO training methods.

Usage

Execute this workflow when you have a domain-specific dataset in JSONL, Parquet, or CSV format and need to adapt a base model hosted on Together AI. This applies to instruction tuning, preference optimization, domain adaptation, or any scenario requiring custom model behavior beyond what prompting achieves.

Execution Steps

Step 1: Dataset Preparation

Prepare training data in one of the supported formats: JSONL (general, conversation, instruction, or preference style), Parquet, or CSV. Each format has specific schema requirements that the SDK validates. For pre-tokenized data, the separate tokenize_data example script can convert HuggingFace datasets to Parquet with configurable sequence packing.

Key considerations:

JSONL supports four sub-formats: general (text field), conversation (messages array), instruction (prompt/completion), and preference (chosen/rejected pairs for DPO)
The conversation format follows the OpenAI messages schema with role and content fields
Parquet files can contain pre-tokenized input_ids for maximum control
Multimodal datasets with image URLs are validated for accessibility

Step 2: Dataset Validation

Run the SDK's built-in file validator against the training data before uploading. The validator performs deep content checks including schema validation, field presence, conversation structure, special token detection, and optionally image URL reachability for multimodal datasets.

Key considerations:

Validation runs locally before any upload to catch errors early
The check_file utility inspects up to the first N samples for format compliance
Warnings are issued for potential problems (e.g., missing system messages) without blocking
Validation can be bypassed with check=False on upload, but this is not recommended

Step 3: File Upload

Upload the validated dataset file to Together's storage. The SDK automatically routes to single-request upload for smaller files or multipart concurrent upload for files exceeding the size threshold (default 5GB). The upload returns a file ID used in subsequent job creation.

Key considerations:

Files under the threshold use a single pre-signed URL upload
Large files are split into parts and uploaded concurrently via MultipartUploadManager
The purpose parameter defaults to "fine-tune" but can be set to "batch-api" for batch inference
Upload progress is tracked internally with file locking to prevent concurrent access issues

Step 4: Job Creation

Create a fine-tuning job by specifying the uploaded file ID, base model, and training hyperparameters. The SDK builds a comprehensive request model covering epochs, batch size, learning rate, LoRA configuration, scheduler settings, and training method (SFT, DPO, RPO).

Key considerations:

The model parameter selects the base model to fine-tune
LoRA is enabled by default with configurable rank, alpha, and dropout
batch_size="max" lets the platform auto-select the largest feasible batch size
Training method can be SFT (default), DPO (with beta parameter), RPO (with alpha), or SimPO (with gamma)
WandB integration is available for experiment tracking
from_checkpoint allows resuming from a previous fine-tuning run
from_hf_model supports starting from a HuggingFace model

Step 5: Job Monitoring

Poll the fine-tuning job status and inspect training events. The SDK provides methods to retrieve job metadata, list training events (including loss values and checkpoints), and cancel running jobs if needed.

Key considerations:

Use retrieve() to check overall job status (pending, running, completed, failed, cancelled)
Use list_events() to get detailed training progress including loss curves
Jobs can be cancelled mid-training with cancel()
n_checkpoints controls how many intermediate checkpoints are saved during training

Step 6: Model Download

Download the fine-tuned model weights or a specific checkpoint to local disk. The SDK handles streaming the compressed model artifact and writing it to the specified output path.

Key considerations:

The default download retrieves the final model weights
Specific checkpoints can be downloaded by step number
checkpoint_type controls whether to download the default, merged, or adapter-only weights
The downloaded artifact is a compressed file that needs to be extracted for use

Execution Diagram

GitHub URL

Workflow Repository