Workflow:Lm sys FastChat Vicuna SFT Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Supervised_Learning |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
End-to-end process for full-parameter supervised fine-tuning (SFT) of LLaMA-based models on multi-turn conversation data to produce Vicuna chat models.
Description
This workflow covers the complete procedure for training Vicuna models from base LLaMA weights using ShareGPT-style conversation data. The training uses the HuggingFace Transformers Trainer with Fully Sharded Data Parallel (FSDP) or DeepSpeed for distributed training across multiple GPUs. The process handles conversation template formatting, target masking (only computing loss on assistant responses), RoPE scaling for extended context, and checkpoint management. The output is a fully fine-tuned causal language model capable of multi-turn instruction following.
Usage
Execute this workflow when you have cleaned conversation data in ShareGPT format (JSON with "conversations" field containing alternating human/gpt turns) and need to produce a full-parameter fine-tuned chat model. This is appropriate when you have access to multiple high-end GPUs (e.g., 4x A100 40GB for 7B models) and want maximum model quality without parameter-efficient constraints.
Execution Steps
Step 1: Environment Setup
Install FastChat with training dependencies. This pulls in PyTorch, Transformers, DeepSpeed, and Flash Attention support. The training extra includes all required packages for distributed training.
Key considerations:
- Install from source with the train extra: pip3 install -e ".[train]"
- Flash Attention requires compatible GPU hardware (A100, H100)
- For V100 GPUs, use the xformers attention variant instead
Step 2: Data Preparation
Prepare conversation data in the ShareGPT JSON format. Each example must contain a "conversations" array with alternating "human" and "gpt" turns. The data should already be cleaned (HTML removed, language filtered, long conversations split) before reaching this step.
Key considerations:
- Data format: JSON list of objects, each with a "conversations" key
- Each conversation is a list of {"from": "human"|"gpt", "value": "..."} entries
- A sample dataset is provided at data/dummy_conversation.json for testing
- The data cleaning pipeline (separate workflow) should be run first for production data
Step 3: Model and Tokenizer Loading
Load the base LLaMA model and tokenizer from HuggingFace Hub or a local path. The loader configures RoPE scaling if the requested max sequence length exceeds the model's native context window, and disables KV cache for training efficiency.
Key considerations:
- RoPE linear scaling is auto-applied when model_max_length exceeds the pretrained context length
- The pad token is set to the unknown token for proper padding behavior
- trust_remote_code can be enabled for custom model architectures
Step 4: Conversation Preprocessing
Apply the Vicuna conversation template to raw conversation data and tokenize. The preprocessing maps human/gpt roles to the template's role format, applies prompt formatting with separators, and creates target masks that exclude user turns from the loss computation.
Key considerations:
- Only assistant (gpt) responses contribute to training loss
- User instructions are masked with IGNORE_TOKEN_ID in the target labels
- Lazy preprocessing is available to defer tokenization to training time, reducing upfront memory usage
- Tokenization mismatches between template formatting and actual token boundaries trigger warnings
Step 5: Distributed Training
Launch the HuggingFace Trainer with FSDP or DeepSpeed for multi-GPU training. The training loop uses AdamW optimizer with cosine learning rate scheduling and optional gradient checkpointing to reduce memory usage.
Key considerations:
- FSDP wraps at the transformer layer level (e.g., LlamaDecoderLayer)
- Standard hyperparameters: batch size 128 (global), lr 2e-5, 3 epochs, warmup ratio 0.03
- Gradient checkpointing trades compute for memory, enabling longer sequences
- Training can resume from checkpoints automatically if they exist in the output directory
- Flash Attention (train_mem.py) or xformers (train_xformers.py) variants reduce memory
Step 6: Model Saving
Save the trained model weights, tokenizer, and training state. The saving procedure handles FSDP state dict consolidation (gathering sharded parameters to rank 0) and DeepSpeed engine state management.
Key considerations:
- FSDP requires special handling via FullStateDictConfig with offload_to_cpu and rank0_only
- DeepSpeed uses its native save_model method
- The use_cache flag is re-enabled after training for inference compatibility
- Training state (optimizer, scheduler) is also saved for potential resumption