Principle:Togethercomputer Together python Fine Tuning Job Creation

Attribute	Value
Principle Name	Fine_Tuning_Job_Creation
Overview	Mechanism for initiating a fine-tuning job on Together AI's infrastructure using uploaded training data.
Domain	MLOps, Fine_Tuning
Repository	togethercomputer/together-python
Last Updated	2026-02-15 16:00 GMT

Description

Fine-tuning job creation submits a training configuration to Together AI's servers, instructing the platform to train a model on the specified dataset. The configuration encompasses multiple dimensions:

Base Model and Starting Point

A job must specify either a base model (via model) or a previous checkpoint (via from_checkpoint) to continue training from, but not both. Additionally, a Hugging Face Hub model can be specified (via from_hf_model) as the starting weights, which requires that a compatible model is also provided.

Training Method

Two training methods are supported:

SFT (Supervised Fine-Tuning, default) -- Standard supervised learning on input-output pairs. Supports the train_on_inputs parameter which controls whether user messages/prompts are masked during loss computation. When set to "auto", masking is determined by data format: conversational and instruction formats mask inputs, while general text format trains on all tokens.
DPO (Direct Preference Optimization) -- Preference-based training using preferred and non-preferred outputs. Supports additional parameters: dpo_beta (regularization strength), dpo_normalize_logratios_by_length (length normalization), rpo_alpha (NLL loss inclusion), and simpo_gamma (SimPO variant, which enables reference-free training and length normalization).

Hyperparameters

The job configuration includes:

Training schedule: n_epochs, batch_size (integer or "max"), learning_rate, warmup_ratio.
Learning rate scheduler: lr_scheduler_type ("cosine" or "linear"), min_lr_ratio, scheduler_num_cycles (for cosine).
Regularization: max_grad_norm, weight_decay.
Checkpointing: n_checkpoints, n_evals.
Validation: validation_file, n_evals.

LoRA Configuration

LoRA (Low-Rank Adaptation) is enabled by default (lora=True). When enabled:

lora_r -- Rank of LoRA adapters (defaults to model's maximum rank).
lora_alpha -- Scaling factor (defaults to 2 * lora_r).
lora_dropout -- Dropout rate (must be in [0, 1) range).
lora_trainable_modules -- Which modules to apply LoRA to (default: "all-linear").

Full training is used when lora=False, but is not supported by all models.

Integration Options

Weights & Biases: wandb_api_key, wandb_base_url, wandb_project_name, wandb_name for logging training metrics.
Hugging Face Hub: from_hf_model, hf_model_revision, hf_api_token for loading starting weights; hf_output_repo_name for pushing the fine-tuned model.
Multimodal: train_vision for training the vision encoder in multimodal models.

Price Estimation

Before submitting the job, the SDK automatically estimates the training cost and warns if the estimated price significantly exceeds the user's available credits. This estimation is skipped when continuing from a checkpoint or using a Hugging Face model.

Usage

Use this after uploading a training file. The typical workflow is:

Upload the training file via client.files.upload() to obtain a file ID.
Call client.fine_tuning.create(training_file=file_id, model="model-name", ...).
Use the returned job ID to monitor progress via client.fine_tuning.retrieve() and client.fine_tuning.list_events().

The method automatically fetches model training limits from the API to validate hyperparameters against model-specific constraints (batch size limits, LoRA support, vision support).

Theoretical Basis

Fine-tuning adapts a pre-trained language model to a specific task by continuing the training process on a targeted dataset. The two supported methods represent different optimization objectives:

SFT minimizes the standard cross-entropy loss between the model's predictions and the target tokens, optionally masking input tokens to focus the loss on output generation.
DPO directly optimizes a preference model from pairs of preferred and non-preferred outputs, without requiring a separate reward model. The dpo_beta parameter controls the strength of the KL divergence constraint from the reference model.

LoRA reduces the number of trainable parameters by decomposing weight updates into low-rank matrices, enabling fine-tuning of large models with significantly less GPU memory and compute. The rank (lora_r) determines the expressiveness of the adaptation, while lora_alpha scales the magnitude of the updates.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment