Implementation:Hpcaitech ColossalAI LoRA Finetune Script

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Fine-tuning, LoRA, Distributed Training, MoE
Last Updated	2026-02-09 00:00 GMT

Overview

lora_finetune.py is a training script for supervised fine-tuning (SFT) of large language models, including Mixture-of-Experts (MoE) models like DeepSeek V3/R1, with optional LoRA (Low-Rank Adaptation) support.

Description

This script implements a complete supervised fine-tuning pipeline using ColossalAI's Booster framework. It supports multiple parallelism strategies including DDP, Gemini, ZeRO-2, 3D parallelism (TP+PP+SP), and MoE hybrid parallelism. The script loads a pretrained causal language model, optionally applies LoRA adapters via PEFT, configures distributed training with gradient checkpointing, mixed precision (fp16/bf16), gradient accumulation, and cosine annealing learning rate scheduling. It handles both pipeline-parallel and non-pipeline-parallel training loops with TensorBoard logging.

Usage

Use this script for supervised fine-tuning of causal language models on conversational datasets. It is particularly suited for fine-tuning large MoE models like DeepSeek V3 with LoRA to reduce trainable parameters. Launch with torchrun or ColossalAI launcher for distributed execution.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalChat/examples/training_scripts/lora_finetune.py
Lines: 1-455

Signature

def all_reduce_mean(loss: torch.Tensor, plugin: Plugin) -> torch.Tensor

def train(args) -> None

Import

# This is a standalone training script, typically run directly:
# torchrun --nproc_per_node=<N> lora_finetune.py -m <model_path> -d <dataset_path>

I/O Contract

Inputs

Name	Type	Required	Description
-m, --pretrained	str	Yes	Path to the pretrained model
-d, --dataset	str	Yes	Path to raw JSONL dataset for training
-p, --plugin	str	No	Plugin choice: gemini, gemini_auto, zero2, zero2_cpu, 3d, ddp, moe (default: zero2)
--save_dir	str	No	Checkpoint directory (default: checkpoint_dir)
--tensorboard_dir	str	No	TensorBoard log directory
-n, --num_epochs	int	No	Number of training epochs (default: 1)
--accumulation_steps	int	No	Gradient accumulation steps (default: 1)
--batch_size	int	No	Batch size per process (default: 2)
--lr	float	No	Learning rate (default: 3e-4)
--max_length	int	No	Model max sequence length (default: 8192)
--mixed_precision	str	No	Mixed precision mode: fp16 or bf16 (default: bf16)
--grad_clip	float	No	Gradient clipping value (default: 1.0)
--lora_rank	int	No	LoRA rank; 0 disables LoRA (default: 0)
--lora_alpha	int	No	LoRA alpha scaling (default: 8)
--tp	int	No	Tensor parallelism size (default: 1)
--pp	int	No	Pipeline parallelism size (default: 1)
--sp	int	No	Sequence parallelism size (default: 1)
--ep	int	No	Expert parallelism size for MoE (default: 1)
-g, --use_grad_checkpoint	flag	No	Enable gradient checkpointing
-f, --use_flash_attn	flag	No	Enable flash attention

Outputs

Name	Type	Description
checkpoint	directory	Model checkpoint saved to --save_dir/modeling or --save_dir/lora
tensorboard_logs	directory	Training loss, learning rate, and gradient norm logs
config_file	JSON	Training configuration saved to --config_file

Usage Examples

# Fine-tune with LoRA on 4 GPUs using ZeRO-2:
# torchrun --nproc_per_node=4 lora_finetune.py \
#     -m deepseek-ai/DeepSeek-V3 \
#     -d ./train_data.jsonl \
#     -p zero2 \
#     --lora_rank 16 \
#     --lora_alpha 32 \
#     --lr 3e-4 \
#     --max_length 4096 \
#     --mixed_precision bf16 \
#     -g -f

# Fine-tune MoE model with expert parallelism:
# torchrun --nproc_per_node=8 lora_finetune.py \
#     -m deepseek-ai/DeepSeek-V3 \
#     -d ./train_data.jsonl \
#     -p moe \
#     --ep 4 \
#     --tp 2

Key Features

LoRA Support - Applies low-rank adapters via PEFT; for DeepSeek V3 targets gate_proj, up_proj, down_proj modules
MoE Parallelism - Supports expert parallelism via MoeHybridParallelPlugin for efficient MoE model training
Pipeline Parallelism - Handles PP training loop with booster.execute_pipeline
Model Loading - Uses from_config initialization followed by booster.load_model for compatibility with LoRA and lazy initialization
DeepSeek V3 Compatibility - Special handling for DeepSeek V3 MoE inference method unwrapping

Related Pages

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment