Workflow:Hpcaitech ColossalAI Supervised Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Distributed_Training |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
End-to-end process for supervised fine-tuning (SFT) of large language models on instruction-following datasets using ColossalAI's distributed training framework.
Description
This workflow covers the complete supervised fine-tuning pipeline for adapting pretrained LLMs to follow instructions. It uses ColossalAI's Booster abstraction with pluggable parallelism strategies (ZeRO-2, Gemini, 3D Parallelism, DDP) to train models across multiple GPUs efficiently. The pipeline supports optional LoRA adaptation for parameter-efficient training, gradient checkpointing for memory optimization, and Flash Attention for compute efficiency. Training data must be in tokenized Arrow format with instruction-response pairs.
Usage
Execute this workflow when you have a tokenized instruction-following dataset (in Arrow format) and need to fine-tune a pretrained causal language model (e.g., LLaMA, Qwen, ChatGLM) to follow instructions. This is the foundational training stage before applying alignment techniques such as DPO or RLHF.
Execution Steps
Step 1: Data Preparation
Transform raw instruction-response pairs into tokenized Arrow format using the dataset preparation scripts. The prepare_dataset.py utility handles multiple dataset formats (SFT, preference, prompt, KTO) and applies the appropriate conversation template for the target model family.
Key considerations:
- Dataset must be in tokenized Arrow format before training
- Apply the correct conversation template matching the target model (LLaMA-2, Qwen, ChatGLM, etc.)
- Split dataset into multiple shards for parallel loading
Step 2: Environment Initialization
Initialize the distributed training environment using ColossalAI's launcher. The launcher supports torchrun, SLURM, and OpenMPI backends and configures process groups, NCCL communication, and device assignment automatically.
What happens:
- Call colossalai.launch_from_torch() to set up distributed training
- Create a DistCoordinator for rank-aware operations
- Select least-utilized GPUs based on memory availability
Step 3: Model Loading
Load the pretrained causal language model from HuggingFace or a local path. Optionally configure Flash Attention 2 for faster attention computation and set the mixed-precision data type (bf16 or fp16).
Key considerations:
- Flash Attention 2 requires compatible GPU architecture (Ampere or later)
- Mixed precision reduces memory usage while maintaining training stability
- LoRA adaptation can be applied at this stage to reduce trainable parameters
Step 4: Plugin and Booster Configuration
Select and configure the parallelism strategy (plugin) that wraps the model, optimizer, and dataloader for distributed training. The Booster applies the chosen strategy transparently.
Available strategies:
- ZeRO Stage 2: Shards optimizer states and gradients across GPUs
- Gemini: Heterogeneous memory management across CPU and GPU
- 3D Parallelism: Combines tensor, pipeline, and data parallelism
- DDP: Standard distributed data parallelism
Step 5: Optimizer and Scheduler Setup
Configure the HybridAdam optimizer with learning rate and weight decay, and the CosineAnnealingWarmupLR scheduler with warmup phase (default 2.5% of total steps).
What happens:
- Create HybridAdam optimizer with configurable lr, weight_decay, and betas
- Calculate total training steps based on dataset size and accumulation steps
- Initialize cosine annealing scheduler with linear warmup
Step 6: Training Execution
Run the SFT training loop via the SFTTrainer class. The trainer handles forward/backward passes, gradient accumulation, checkpoint saving at intervals, and optional evaluation after each epoch.
What happens:
- For standard parallelism: iterate through batches, compute cross-entropy loss with labels, backward via booster
- For pipeline parallelism: use booster.execute_pipeline() for staged execution
- Accumulate gradients over configurable number of steps before optimizer update
- Log loss and learning rate to TensorBoard and optionally Weights & Biases
- Save checkpoints at configurable intervals with full training state
Step 7: Model Saving
Save the final fine-tuned model checkpoint with sharded weights. For LoRA training, the model is set to eval mode to merge adapter weights before saving.
Key considerations:
- Model is saved with shard_size=10 for manageable checkpoint files
- LoRA weights are merged into the base model during eval mode before saving
- Checkpoint includes model weights, optimizer state, scheduler state, and training metadata