Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat HF Trainer Train FSDP

From Leeroopedia


Field Value
Page Type Implementation (Wrapper Doc)
Title HF Trainer Train FSDP
Repository lm-sys/FastChat
Workflow Vicuna SFT Finetuning
Domains Distributed Training, FSDP, Hugging Face Trainer
Knowledge Sources fastchat/train/train.py, Hugging Face Transformers Trainer documentation
Last Updated 2026-02-07 14:00 GMT

Overview

This implementation documents how the Hugging Face Trainer class is instantiated and invoked within the Vicuna SFT training script to execute distributed training via FSDP. The training loop, including automatic checkpoint resumption, is handled by the Trainer.train() method, with FSDP configuration provided through TrainingArguments.

Description

The core training execution in fastchat/train/train.py consists of three steps:

Step 1: Trainer Instantiation

The Trainer is constructed with the model, tokenizer, training arguments, and the data module (training and evaluation datasets):

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    **data_module,
)

The data_module dictionary is unpacked to provide train_dataset and eval_dataset keyword arguments.

Step 2: Training with Checkpoint Resumption

Training is launched with automatic checkpoint detection:

if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
    trainer.train(resume_from_checkpoint=True)
else:
    trainer.train()

If any checkpoint-* directories exist in the output directory, training resumes from the most recent checkpoint. Otherwise, training starts from scratch.

Step 3: Post-Training

After training completes, the model cache is re-enabled and the model is saved:

model.config.use_cache = True
trainer.save_state()
if trainer.is_deepspeed_enabled:
    trainer.save_model()
else:
    trainer_save_model_safe(trainer)

Custom TrainingArguments

The FastChat training script extends transformers.TrainingArguments with a custom TrainingArguments dataclass:

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(
        default=512,
        metadata={
            "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
        },
    )

Key defaults:

  • optim: "adamw_torch" (PyTorch native AdamW)
  • model_max_length: 512 (can be overridden, e.g., 2048 for Vicuna)
  • cache_dir: None (optional cache for downloaded artifacts)

FSDP is configured via the standard fsdp and fsdp_config fields inherited from transformers.TrainingArguments.

Training Variants

The FastChat repository provides multiple training script variants optimized for different attention implementations:

Script Description Attention Mechanism
fastchat/train/train.py Standard training script Default attention
fastchat/train/train_mem.py Flash Attention variant Flash Attention 2 for memory-efficient training
fastchat/train/train_xformers.py xFormers variant xFormers memory-efficient attention

All variants share the same Trainer-based training loop; they differ in how attention is computed during the forward pass.

Usage

Code Reference

Source Location

fastchat/train/train.py:L300-306 (Trainer instantiation and training loop)

fastchat/train/train.py:L62-70 (TrainingArguments dataclass)

Signature

# Trainer construction
trainer = Trainer(
    model: transformers.PreTrainedModel,
    tokenizer: transformers.PreTrainedTokenizer,
    args: TrainingArguments,
    train_dataset: Dataset,
    eval_dataset: Optional[Dataset],
)

# Training invocation
trainer.train(resume_from_checkpoint: Optional[bool] = None)

Import

from transformers import Trainer

I/O Contract

Inputs

Parameter Type Description
model transformers.PreTrainedModel The loaded causal LM with use_cache=False.
tokenizer transformers.PreTrainedTokenizer The configured tokenizer (used for data collation and saving).
args TrainingArguments Extended training arguments including FSDP config, optimizer, batch size, learning rate, etc.
train_dataset SupervisedDataset or LazySupervisedDataset Training dataset returning dicts with input_ids, labels, attention_mask.
eval_dataset SupervisedDataset, LazySupervisedDataset, or None Optional evaluation dataset.

Outputs

Output Description
Trained model parameters Model weights updated via SFT.
Checkpoints Periodic checkpoints saved to output_dir/checkpoint-{step}.
Training state Optimizer state, scheduler state, and training metadata saved via trainer.save_state().
Final model Complete model saved in Hugging Face format via trainer_save_model_safe() or trainer.save_model().

Usage Examples

Launching Vicuna SFT training with FSDP:

torchrun --nproc_per_node=8 --nnode=1 \
    -m fastchat.train.train \
    --model_name_or_path lmsys/vicuna-7b-v1.5 \
    --data_path data/sharegpt_clean.json \
    --output_dir output/vicuna-7b-sft \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --model_max_length 2048 \
    --bf16 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --lazy_preprocess True \
    --save_strategy steps \
    --save_steps 500

Using the Flash Attention variant for reduced memory:

torchrun --nproc_per_node=8 \
    -m fastchat.train.train_mem \
    --model_name_or_path lmsys/vicuna-7b-v1.5 \
    --data_path data/sharegpt_clean.json \
    --output_dir output/vicuna-7b-sft-flash \
    --bf16 True \
    --fsdp "full_shard auto_wrap"

External References

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment