Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Hpcaitech ColossalAI Supervised Finetuning

From Leeroopedia


Knowledge Sources
Domains LLMs, Fine_Tuning, Distributed_Training
Last Updated 2026-02-09 03:00 GMT

Overview

End-to-end process for supervised fine-tuning (SFT) of large language models on instruction-following datasets using ColossalAI's distributed training framework.

Description

This workflow covers the complete supervised fine-tuning pipeline for adapting pretrained LLMs to follow instructions. It uses ColossalAI's Booster abstraction with pluggable parallelism strategies (ZeRO-2, Gemini, 3D Parallelism, DDP) to train models across multiple GPUs efficiently. The pipeline supports optional LoRA adaptation for parameter-efficient training, gradient checkpointing for memory optimization, and Flash Attention for compute efficiency. Training data must be in tokenized Arrow format with instruction-response pairs.

Usage

Execute this workflow when you have a tokenized instruction-following dataset (in Arrow format) and need to fine-tune a pretrained causal language model (e.g., LLaMA, Qwen, ChatGLM) to follow instructions. This is the foundational training stage before applying alignment techniques such as DPO or RLHF.

Execution Steps

Step 1: Data Preparation

Transform raw instruction-response pairs into tokenized Arrow format using the dataset preparation scripts. The prepare_dataset.py utility handles multiple dataset formats (SFT, preference, prompt, KTO) and applies the appropriate conversation template for the target model family.

Key considerations:

  • Dataset must be in tokenized Arrow format before training
  • Apply the correct conversation template matching the target model (LLaMA-2, Qwen, ChatGLM, etc.)
  • Split dataset into multiple shards for parallel loading

Step 2: Environment Initialization

Initialize the distributed training environment using ColossalAI's launcher. The launcher supports torchrun, SLURM, and OpenMPI backends and configures process groups, NCCL communication, and device assignment automatically.

What happens:

  • Call colossalai.launch_from_torch() to set up distributed training
  • Create a DistCoordinator for rank-aware operations
  • Select least-utilized GPUs based on memory availability

Step 3: Model Loading

Load the pretrained causal language model from HuggingFace or a local path. Optionally configure Flash Attention 2 for faster attention computation and set the mixed-precision data type (bf16 or fp16).

Key considerations:

  • Flash Attention 2 requires compatible GPU architecture (Ampere or later)
  • Mixed precision reduces memory usage while maintaining training stability
  • LoRA adaptation can be applied at this stage to reduce trainable parameters

Step 4: Plugin and Booster Configuration

Select and configure the parallelism strategy (plugin) that wraps the model, optimizer, and dataloader for distributed training. The Booster applies the chosen strategy transparently.

Available strategies:

  • ZeRO Stage 2: Shards optimizer states and gradients across GPUs
  • Gemini: Heterogeneous memory management across CPU and GPU
  • 3D Parallelism: Combines tensor, pipeline, and data parallelism
  • DDP: Standard distributed data parallelism

Step 5: Optimizer and Scheduler Setup

Configure the HybridAdam optimizer with learning rate and weight decay, and the CosineAnnealingWarmupLR scheduler with warmup phase (default 2.5% of total steps).

What happens:

  • Create HybridAdam optimizer with configurable lr, weight_decay, and betas
  • Calculate total training steps based on dataset size and accumulation steps
  • Initialize cosine annealing scheduler with linear warmup

Step 6: Training Execution

Run the SFT training loop via the SFTTrainer class. The trainer handles forward/backward passes, gradient accumulation, checkpoint saving at intervals, and optional evaluation after each epoch.

What happens:

  • For standard parallelism: iterate through batches, compute cross-entropy loss with labels, backward via booster
  • For pipeline parallelism: use booster.execute_pipeline() for staged execution
  • Accumulate gradients over configurable number of steps before optimizer update
  • Log loss and learning rate to TensorBoard and optionally Weights & Biases
  • Save checkpoints at configurable intervals with full training state

Step 7: Model Saving

Save the final fine-tuned model checkpoint with sharded weights. For LoRA training, the model is set to eval mode to merge adapter weights before saving.

Key considerations:

  • Model is saved with shard_size=10 for manageable checkpoint files
  • LoRA weights are merged into the base model during eval mode before saving
  • Checkpoint includes model weights, optimizer state, scheduler state, and training metadata

Execution Diagram

GitHub URL

Workflow Repository