Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Lm sys FastChat Vicuna SFT Finetuning

From Leeroopedia
Revision as of 11:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Lm_sys_FastChat_Vicuna_SFT_Finetuning.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLMs, Fine_Tuning, Supervised_Learning
Last Updated 2026-02-07 04:00 GMT

Overview

End-to-end process for full-parameter supervised fine-tuning (SFT) of LLaMA-based models on multi-turn conversation data to produce Vicuna chat models.

Description

This workflow covers the complete procedure for training Vicuna models from base LLaMA weights using ShareGPT-style conversation data. The training uses the HuggingFace Transformers Trainer with Fully Sharded Data Parallel (FSDP) or DeepSpeed for distributed training across multiple GPUs. The process handles conversation template formatting, target masking (only computing loss on assistant responses), RoPE scaling for extended context, and checkpoint management. The output is a fully fine-tuned causal language model capable of multi-turn instruction following.

Usage

Execute this workflow when you have cleaned conversation data in ShareGPT format (JSON with "conversations" field containing alternating human/gpt turns) and need to produce a full-parameter fine-tuned chat model. This is appropriate when you have access to multiple high-end GPUs (e.g., 4x A100 40GB for 7B models) and want maximum model quality without parameter-efficient constraints.

Execution Steps

Step 1: Environment Setup

Install FastChat with training dependencies. This pulls in PyTorch, Transformers, DeepSpeed, and Flash Attention support. The training extra includes all required packages for distributed training.

Key considerations:

  • Install from source with the train extra: pip3 install -e ".[train]"
  • Flash Attention requires compatible GPU hardware (A100, H100)
  • For V100 GPUs, use the xformers attention variant instead

Step 2: Data Preparation

Prepare conversation data in the ShareGPT JSON format. Each example must contain a "conversations" array with alternating "human" and "gpt" turns. The data should already be cleaned (HTML removed, language filtered, long conversations split) before reaching this step.

Key considerations:

  • Data format: JSON list of objects, each with a "conversations" key
  • Each conversation is a list of {"from": "human"|"gpt", "value": "..."} entries
  • A sample dataset is provided at data/dummy_conversation.json for testing
  • The data cleaning pipeline (separate workflow) should be run first for production data

Step 3: Model and Tokenizer Loading

Load the base LLaMA model and tokenizer from HuggingFace Hub or a local path. The loader configures RoPE scaling if the requested max sequence length exceeds the model's native context window, and disables KV cache for training efficiency.

Key considerations:

  • RoPE linear scaling is auto-applied when model_max_length exceeds the pretrained context length
  • The pad token is set to the unknown token for proper padding behavior
  • trust_remote_code can be enabled for custom model architectures

Step 4: Conversation Preprocessing

Apply the Vicuna conversation template to raw conversation data and tokenize. The preprocessing maps human/gpt roles to the template's role format, applies prompt formatting with separators, and creates target masks that exclude user turns from the loss computation.

Key considerations:

  • Only assistant (gpt) responses contribute to training loss
  • User instructions are masked with IGNORE_TOKEN_ID in the target labels
  • Lazy preprocessing is available to defer tokenization to training time, reducing upfront memory usage
  • Tokenization mismatches between template formatting and actual token boundaries trigger warnings

Step 5: Distributed Training

Launch the HuggingFace Trainer with FSDP or DeepSpeed for multi-GPU training. The training loop uses AdamW optimizer with cosine learning rate scheduling and optional gradient checkpointing to reduce memory usage.

Key considerations:

  • FSDP wraps at the transformer layer level (e.g., LlamaDecoderLayer)
  • Standard hyperparameters: batch size 128 (global), lr 2e-5, 3 epochs, warmup ratio 0.03
  • Gradient checkpointing trades compute for memory, enabling longer sequences
  • Training can resume from checkpoints automatically if they exist in the output directory
  • Flash Attention (train_mem.py) or xformers (train_xformers.py) variants reduce memory

Step 6: Model Saving

Save the trained model weights, tokenizer, and training state. The saving procedure handles FSDP state dict consolidation (gathering sharded parameters to rank 0) and DeepSpeed engine state management.

Key considerations:

  • FSDP requires special handling via FullStateDictConfig with offload_to_cpu and rank0_only
  • DeepSpeed uses its native save_model method
  • The use_cache flag is re-enabled after training for inference compatibility
  • Training state (optimizer, scheduler) is also saved for potential resumption

Execution Diagram

GitHub URL

Workflow Repository