Heuristic:Togethercomputer Together python Fine Tuning Parameter Validation
| Knowledge Sources | |
|---|---|
| Domains | Fine_Tuning, Optimization |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Parameter validation rules and default behaviors for Together AI fine-tuning jobs, including batch size calculation for DPO and auto-detection of train-on-inputs.
Description
The SDK performs extensive client-side validation of fine-tuning hyperparameters before submitting jobs to the API. Key behaviors include: DPO batch sizes are automatically derived from SFT limits (halved, rounded to multiple of 8), `train_on_inputs` defaults to `"auto"` for SFT if not set, and LoRA parameters have specific range constraints. Understanding these defaults and limits prevents expensive failed training jobs.
Usage
Apply this heuristic when configuring fine-tuning jobs via `client.fine_tuning.create()`. Especially important when:
- Switching between SFT and DPO training methods
- Using LoRA adapters
- Tuning learning rate schedules
The Insight (Rule of Thumb)
Batch Size:
- Action: Use `batch_size="max"` to auto-select the optimal batch size.
- Value: DPO max batch size = `max(min_batch_size, round_down_to_8(max_batch_size / 2))`.
- Trade-off: DPO needs half the SFT batch size because it processes preference pairs. Rounding to 8 ensures GPU memory alignment.
Train on Inputs:
- Action: For SFT, if `train_on_inputs` is not specified, it defaults to `"auto"`.
- Value: `"auto"` lets the system decide based on dataset format whether to mask input tokens.
- Trade-off: Explicit `True`/`False` gives full control; `"auto"` is safer for beginners.
LoRA Defaults:
- Action: If `lora_alpha` is not specified, it defaults to `lora_r * 2`.
- Value: `lora_alpha = 2 * lora_r` is a standard scaling choice.
- Trade-off: Higher alpha amplifies LoRA updates; lower alpha is more conservative.
Parameter Constraints:
- `warmup_ratio`: must be in [0, 1]
- `min_lr_ratio`: must be in [0, 1]
- `max_grad_norm`: must be >= 0
- `weight_decay`: must be >= 0
- `lora_dropout`: must be in [0, 1)
- `scheduler_num_cycles`: must be > 0 (for cosine scheduler)
- `rpo_alpha`, `simpo_gamma`: must be >= 0
Method-Specific Parameters:
- `train_on_inputs` only works with SFT (not DPO)
- `dpo_beta`, `dpo_normalize_logratios_by_length`, `rpo_alpha`, `simpo_gamma` only work with DPO
Reasoning
DPO batch halving: DPO (Direct Preference Optimization) processes preference pairs, so each training step effectively uses 2x the memory per sample compared to SFT. Halving the max batch size prevents OOM errors. Rounding down to the nearest multiple of 8 ensures efficient GPU tensor operations.
Auto train_on_inputs: Masking input tokens during loss computation prevents the model from being trained to reproduce the user prompt, focusing it on generating completions. The `"auto"` mode detects whether this is appropriate based on the dataset format.
Code evidence from `src/together/types/finetune.py:404-409`:
def __init__(self, **data: Any) -> None:
super().__init__(**data)
if self.max_batch_size_dpo == -1:
half_max = self.max_batch_size // 2
rounded_half_max = (half_max // 8) * 8
self.max_batch_size_dpo = max(self.min_batch_size, rounded_half_max)
Code evidence from `src/together/resources/finetune.py:193-197`:
if train_on_inputs is None and training_method == "sft":
log_warn_once(
"train_on_inputs is not set for SFT training, it will be set to 'auto'"
)
train_on_inputs = "auto"
LoRA alpha default from `src/together/resources/finetune.py:130-131`:
lora_r = lora_r if lora_r is not None else model_limits.lora_training.max_rank
lora_alpha = lora_alpha if lora_alpha is not None else lora_r * 2