Heuristic:Lm sys FastChat Vicuna SFT Training Hyperparameters

Knowledge Sources	lm-sys/FastChat Training Docs
Domains	LLMs, Optimization
Last Updated	2026-02-07 04:00 GMT

Overview

Reference training hyperparameters for Vicuna SFT and LoRA fine-tuning, including per-model-size GPU configs, batch sizes, learning rates, and FSDP/DeepSpeed settings.

Description

FastChat provides three reference training scripts with tested hyperparameters for reproducing Vicuna models. These configurations represent battle-tested defaults that balance training quality, memory efficiency, and training speed. The key insight is that effective batch size (micro-batch * gradient accumulation * GPU count) should be 128 for SFT, and that larger models require CPU offloading and different FSDP configurations.

Usage

Use this heuristic when setting up new fine-tuning runs or debugging training instability. These values are the baseline for Vicuna and should be adjusted based on dataset size and available hardware.

The Insight (Rule of Thumb)

Vicuna 7B SFT (4x GPU)

Effective batch size: 128 (2 micro-batch * 16 gradient accumulation * 4 GPUs)
Learning rate: 2e-5 with cosine schedule and 4% warmup
Precision: bfloat16 + TF32
FSDP: `"full_shard auto_wrap"` wrapping `LlamaDecoderLayer`
Sequence length: 2048 tokens
Epochs: 3
Gradient checkpointing: Enabled
Key: Uses Flash Attention via `train_mem.py`

Vicuna 13B SFT (8x GPU)

Effective batch size: 128 (4 micro-batch * 4 gradient accumulation * 8 GPUs)
Learning rate: 2e-5 with cosine schedule and 4% warmup
FSDP: `"full_shard auto_wrap offload"` — adds CPU offload for larger model
Key difference from 7B: Larger per-device batch (4 vs 2), fewer gradient accumulation steps (4 vs 16), CPU offload enabled

LoRA Fine-tuning

LoRA rank: 8
LoRA alpha: 16 (alpha/rank ratio = 2)
LoRA dropout: 0.05
Target modules: `["q_proj", "v_proj"]`
Learning rate: 2e-5
Warmup ratio: 3% (vs 4% for SFT)
DeepSpeed: Stage 2 or 3 with CPU optimizer offload
Precision: fp16 (not bfloat16 like SFT)
Key: Use `--q_lora True` for 4-bit quantized training

Reasoning

The hyperparameters reflect several trade-offs:

Effective batch size = 128: This is a common sweet spot for instruction tuning. Too small causes noisy gradients; too large wastes compute on redundant updates. The 7B and 13B scripts achieve the same effective batch size through different combinations of micro-batch and gradient accumulation.

Learning rate = 2e-5: Conservative LR for fine-tuning pre-trained models. Higher rates risk catastrophic forgetting; lower rates under-learn.

Cosine schedule + 4% warmup: Gradual warmup prevents early instability; cosine decay avoids the abrupt drops of step-based schedules.

FSDP wrapping at decoder layer: Wrapping at `LlamaDecoderLayer` granularity balances memory savings with communication overhead. Finer granularity saves more memory but increases all-gather frequency.

13B CPU offload: At 13B parameters with 8 GPUs, the optimizer states exceed available GPU memory, requiring CPU offloading. The 7B model fits without offloading.

LoRA alpha/rank = 2: This ratio (16/8) provides a reasonable learning rate scaling for the low-rank updates. Higher ratios make LoRA updates more aggressive.

Code Evidence

7B training script from `scripts/train_vicuna_7b.sh:1-25`:

torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path ~/model_weights/llama-7b  \
    --bf16 True \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True

LoRA default hyperparameters from `fastchat/train/train_lora.py:56-65`:

class LoraArguments:
    lora_r: int = 8
    lora_alpha: int = 16
    lora_dropout: float = 0.05
    lora_target_modules: typing.List[str] = field(
        default_factory=lambda: ["q_proj", "v_proj"]
    )
    lora_bias: str = "none"
    q_lora: bool = False

Default optimizer and sequence length from `fastchat/train/train.py:62-70`:

class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(default=512, ...)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment