Heuristic:Togethercomputer Together python Fine Tuning Parameter Validation

Knowledge Sources	Together Python SDK Together Fine-Tuning API
Domains	Fine_Tuning, Optimization
Last Updated	2026-02-15 16:00 GMT

Overview

Parameter validation rules and default behaviors for Together AI fine-tuning jobs, including batch size calculation for DPO and auto-detection of train-on-inputs.

Description

The SDK performs extensive client-side validation of fine-tuning hyperparameters before submitting jobs to the API. Key behaviors include: DPO batch sizes are automatically derived from SFT limits (halved, rounded to multiple of 8), `train_on_inputs` defaults to `"auto"` for SFT if not set, and LoRA parameters have specific range constraints. Understanding these defaults and limits prevents expensive failed training jobs.

Usage

Apply this heuristic when configuring fine-tuning jobs via `client.fine_tuning.create()`. Especially important when:

Switching between SFT and DPO training methods
Using LoRA adapters
Tuning learning rate schedules

The Insight (Rule of Thumb)

Batch Size:

Action: Use `batch_size="max"` to auto-select the optimal batch size.
Value: DPO max batch size = `max(min_batch_size, round_down_to_8(max_batch_size / 2))`.
Trade-off: DPO needs half the SFT batch size because it processes preference pairs. Rounding to 8 ensures GPU memory alignment.

Train on Inputs:

Action: For SFT, if `train_on_inputs` is not specified, it defaults to `"auto"`.
Value: `"auto"` lets the system decide based on dataset format whether to mask input tokens.
Trade-off: Explicit `True`/`False` gives full control; `"auto"` is safer for beginners.

LoRA Defaults:

Action: If `lora_alpha` is not specified, it defaults to `lora_r * 2`.
Value: `lora_alpha = 2 * lora_r` is a standard scaling choice.
Trade-off: Higher alpha amplifies LoRA updates; lower alpha is more conservative.

Parameter Constraints:

`warmup_ratio`: must be in [0, 1]
`min_lr_ratio`: must be in [0, 1]
`max_grad_norm`: must be >= 0
`weight_decay`: must be >= 0
`lora_dropout`: must be in [0, 1)
`scheduler_num_cycles`: must be > 0 (for cosine scheduler)
`rpo_alpha`, `simpo_gamma`: must be >= 0

Method-Specific Parameters:

`train_on_inputs` only works with SFT (not DPO)
`dpo_beta`, `dpo_normalize_logratios_by_length`, `rpo_alpha`, `simpo_gamma` only work with DPO

Reasoning

DPO batch halving: DPO (Direct Preference Optimization) processes preference pairs, so each training step effectively uses 2x the memory per sample compared to SFT. Halving the max batch size prevents OOM errors. Rounding down to the nearest multiple of 8 ensures efficient GPU tensor operations.

Auto train_on_inputs: Masking input tokens during loss computation prevents the model from being trained to reproduce the user prompt, focusing it on generating completions. The `"auto"` mode detects whether this is appropriate based on the dataset format.

Code evidence from `src/together/types/finetune.py:404-409`:

def __init__(self, **data: Any) -> None:
    super().__init__(**data)
    if self.max_batch_size_dpo == -1:
        half_max = self.max_batch_size // 2
        rounded_half_max = (half_max // 8) * 8
        self.max_batch_size_dpo = max(self.min_batch_size, rounded_half_max)

Code evidence from `src/together/resources/finetune.py:193-197`:

if train_on_inputs is None and training_method == "sft":
    log_warn_once(
        "train_on_inputs is not set for SFT training, it will be set to 'auto'"
    )
    train_on_inputs = "auto"

LoRA alpha default from `src/together/resources/finetune.py:130-131`:

lora_r = lora_r if lora_r is not None else model_limits.lora_training.max_rank
lora_alpha = lora_alpha if lora_alpha is not None else lora_r * 2

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment