Heuristic:Lm sys FastChat Vicuna SFT Training Hyperparameters
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Optimization |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
Reference training hyperparameters for Vicuna SFT and LoRA fine-tuning, including per-model-size GPU configs, batch sizes, learning rates, and FSDP/DeepSpeed settings.
Description
FastChat provides three reference training scripts with tested hyperparameters for reproducing Vicuna models. These configurations represent battle-tested defaults that balance training quality, memory efficiency, and training speed. The key insight is that effective batch size (micro-batch * gradient accumulation * GPU count) should be 128 for SFT, and that larger models require CPU offloading and different FSDP configurations.
Usage
Use this heuristic when setting up new fine-tuning runs or debugging training instability. These values are the baseline for Vicuna and should be adjusted based on dataset size and available hardware.
The Insight (Rule of Thumb)
Vicuna 7B SFT (4x GPU)
- Effective batch size: 128 (2 micro-batch * 16 gradient accumulation * 4 GPUs)
- Learning rate: 2e-5 with cosine schedule and 4% warmup
- Precision: bfloat16 + TF32
- FSDP: `"full_shard auto_wrap"` wrapping `LlamaDecoderLayer`
- Sequence length: 2048 tokens
- Epochs: 3
- Gradient checkpointing: Enabled
- Key: Uses Flash Attention via `train_mem.py`
Vicuna 13B SFT (8x GPU)
- Effective batch size: 128 (4 micro-batch * 4 gradient accumulation * 8 GPUs)
- Learning rate: 2e-5 with cosine schedule and 4% warmup
- FSDP: `"full_shard auto_wrap offload"` — adds CPU offload for larger model
- Key difference from 7B: Larger per-device batch (4 vs 2), fewer gradient accumulation steps (4 vs 16), CPU offload enabled
LoRA Fine-tuning
- LoRA rank: 8
- LoRA alpha: 16 (alpha/rank ratio = 2)
- LoRA dropout: 0.05
- Target modules: `["q_proj", "v_proj"]`
- Learning rate: 2e-5
- Warmup ratio: 3% (vs 4% for SFT)
- DeepSpeed: Stage 2 or 3 with CPU optimizer offload
- Precision: fp16 (not bfloat16 like SFT)
- Key: Use `--q_lora True` for 4-bit quantized training
Reasoning
The hyperparameters reflect several trade-offs:
Effective batch size = 128: This is a common sweet spot for instruction tuning. Too small causes noisy gradients; too large wastes compute on redundant updates. The 7B and 13B scripts achieve the same effective batch size through different combinations of micro-batch and gradient accumulation.
Learning rate = 2e-5: Conservative LR for fine-tuning pre-trained models. Higher rates risk catastrophic forgetting; lower rates under-learn.
Cosine schedule + 4% warmup: Gradual warmup prevents early instability; cosine decay avoids the abrupt drops of step-based schedules.
FSDP wrapping at decoder layer: Wrapping at `LlamaDecoderLayer` granularity balances memory savings with communication overhead. Finer granularity saves more memory but increases all-gather frequency.
13B CPU offload: At 13B parameters with 8 GPUs, the optimizer states exceed available GPU memory, requiring CPU offloading. The 7B model fits without offloading.
LoRA alpha/rank = 2: This ratio (16/8) provides a reasonable learning rate scaling for the low-rank updates. Higher ratios make LoRA updates more aggressive.
Code Evidence
7B training script from `scripts/train_vicuna_7b.sh:1-25`:
torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
--model_name_or_path ~/model_weights/llama-7b \
--bf16 True \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 16 \
--learning_rate 2e-5 \
--warmup_ratio 0.04 \
--lr_scheduler_type "cosine" \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True
LoRA default hyperparameters from `fastchat/train/train_lora.py:56-65`:
class LoraArguments:
lora_r: int = 8
lora_alpha: int = 16
lora_dropout: float = 0.05
lora_target_modules: typing.List[str] = field(
default_factory=lambda: ["q_proj", "v_proj"]
)
lora_bias: str = "none"
q_lora: bool = False
Default optimizer and sequence length from `fastchat/train/train.py:62-70`:
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
model_max_length: int = field(default=512, ...)