Workflow:Hpcaitech ColossalAI Supervised Finetuning

Knowledge Sources	ColossalAI ColossalAI Docs ColossalChat Examples README
Domains	LLMs, Fine_Tuning, Distributed_Training
Last Updated	2026-02-09 03:00 GMT

Overview

End-to-end process for supervised fine-tuning (SFT) of large language models on instruction-following datasets using ColossalAI's distributed training framework.

Description

This workflow covers the complete supervised fine-tuning pipeline for adapting pretrained LLMs to follow instructions. It uses ColossalAI's Booster abstraction with pluggable parallelism strategies (ZeRO-2, Gemini, 3D Parallelism, DDP) to train models across multiple GPUs efficiently. The pipeline supports optional LoRA adaptation for parameter-efficient training, gradient checkpointing for memory optimization, and Flash Attention for compute efficiency. Training data must be in tokenized Arrow format with instruction-response pairs.

Usage

Execute this workflow when you have a tokenized instruction-following dataset (in Arrow format) and need to fine-tune a pretrained causal language model (e.g., LLaMA, Qwen, ChatGLM) to follow instructions. This is the foundational training stage before applying alignment techniques such as DPO or RLHF.

Execution Steps

Step 1: Data Preparation

Transform raw instruction-response pairs into tokenized Arrow format using the dataset preparation scripts. The prepare_dataset.py utility handles multiple dataset formats (SFT, preference, prompt, KTO) and applies the appropriate conversation template for the target model family.

Key considerations:

Dataset must be in tokenized Arrow format before training
Apply the correct conversation template matching the target model (LLaMA-2, Qwen, ChatGLM, etc.)
Split dataset into multiple shards for parallel loading

Step 2: Environment Initialization

Initialize the distributed training environment using ColossalAI's launcher. The launcher supports torchrun, SLURM, and OpenMPI backends and configures process groups, NCCL communication, and device assignment automatically.

What happens:

Call colossalai.launch_from_torch() to set up distributed training
Create a DistCoordinator for rank-aware operations
Select least-utilized GPUs based on memory availability

Step 3: Model Loading

Load the pretrained causal language model from HuggingFace or a local path. Optionally configure Flash Attention 2 for faster attention computation and set the mixed-precision data type (bf16 or fp16).

Key considerations:

Flash Attention 2 requires compatible GPU architecture (Ampere or later)
Mixed precision reduces memory usage while maintaining training stability
LoRA adaptation can be applied at this stage to reduce trainable parameters

Step 4: Plugin and Booster Configuration

Select and configure the parallelism strategy (plugin) that wraps the model, optimizer, and dataloader for distributed training. The Booster applies the chosen strategy transparently.

Available strategies:

ZeRO Stage 2: Shards optimizer states and gradients across GPUs
Gemini: Heterogeneous memory management across CPU and GPU
3D Parallelism: Combines tensor, pipeline, and data parallelism
DDP: Standard distributed data parallelism

Step 5: Optimizer and Scheduler Setup

Configure the HybridAdam optimizer with learning rate and weight decay, and the CosineAnnealingWarmupLR scheduler with warmup phase (default 2.5% of total steps).

What happens:

Create HybridAdam optimizer with configurable lr, weight_decay, and betas
Calculate total training steps based on dataset size and accumulation steps
Initialize cosine annealing scheduler with linear warmup

Step 6: Training Execution

Run the SFT training loop via the SFTTrainer class. The trainer handles forward/backward passes, gradient accumulation, checkpoint saving at intervals, and optional evaluation after each epoch.

What happens:

For standard parallelism: iterate through batches, compute cross-entropy loss with labels, backward via booster
For pipeline parallelism: use booster.execute_pipeline() for staged execution
Accumulate gradients over configurable number of steps before optimizer update
Log loss and learning rate to TensorBoard and optionally Weights & Biases
Save checkpoints at configurable intervals with full training state

Step 7: Model Saving

Save the final fine-tuned model checkpoint with sharded weights. For LoRA training, the model is set to eval mode to merge adapter weights before saving.

Key considerations:

Model is saved with shard_size=10 for manageable checkpoint files
LoRA weights are merged into the base model during eval mode before saving
Checkpoint includes model weights, optimizer state, scheduler state, and training metadata

Execution Diagram

GitHub URL

Workflow Repository