Implementation:Hpcaitech ColossalAI LoRA Finetune Script
| Knowledge Sources | |
|---|---|
| Domains | Fine-tuning, LoRA, Distributed Training, MoE |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
lora_finetune.py is a training script for supervised fine-tuning (SFT) of large language models, including Mixture-of-Experts (MoE) models like DeepSeek V3/R1, with optional LoRA (Low-Rank Adaptation) support.
Description
This script implements a complete supervised fine-tuning pipeline using ColossalAI's Booster framework. It supports multiple parallelism strategies including DDP, Gemini, ZeRO-2, 3D parallelism (TP+PP+SP), and MoE hybrid parallelism. The script loads a pretrained causal language model, optionally applies LoRA adapters via PEFT, configures distributed training with gradient checkpointing, mixed precision (fp16/bf16), gradient accumulation, and cosine annealing learning rate scheduling. It handles both pipeline-parallel and non-pipeline-parallel training loops with TensorBoard logging.
Usage
Use this script for supervised fine-tuning of causal language models on conversational datasets. It is particularly suited for fine-tuning large MoE models like DeepSeek V3 with LoRA to reduce trainable parameters. Launch with torchrun or ColossalAI launcher for distributed execution.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/examples/training_scripts/lora_finetune.py
- Lines: 1-455
Signature
def all_reduce_mean(loss: torch.Tensor, plugin: Plugin) -> torch.Tensor
def train(args) -> None
Import
# This is a standalone training script, typically run directly:
# torchrun --nproc_per_node=<N> lora_finetune.py -m <model_path> -d <dataset_path>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m, --pretrained | str | Yes | Path to the pretrained model |
| -d, --dataset | str | Yes | Path to raw JSONL dataset for training |
| -p, --plugin | str | No | Plugin choice: gemini, gemini_auto, zero2, zero2_cpu, 3d, ddp, moe (default: zero2) |
| --save_dir | str | No | Checkpoint directory (default: checkpoint_dir) |
| --tensorboard_dir | str | No | TensorBoard log directory |
| -n, --num_epochs | int | No | Number of training epochs (default: 1) |
| --accumulation_steps | int | No | Gradient accumulation steps (default: 1) |
| --batch_size | int | No | Batch size per process (default: 2) |
| --lr | float | No | Learning rate (default: 3e-4) |
| --max_length | int | No | Model max sequence length (default: 8192) |
| --mixed_precision | str | No | Mixed precision mode: fp16 or bf16 (default: bf16) |
| --grad_clip | float | No | Gradient clipping value (default: 1.0) |
| --lora_rank | int | No | LoRA rank; 0 disables LoRA (default: 0) |
| --lora_alpha | int | No | LoRA alpha scaling (default: 8) |
| --tp | int | No | Tensor parallelism size (default: 1) |
| --pp | int | No | Pipeline parallelism size (default: 1) |
| --sp | int | No | Sequence parallelism size (default: 1) |
| --ep | int | No | Expert parallelism size for MoE (default: 1) |
| -g, --use_grad_checkpoint | flag | No | Enable gradient checkpointing |
| -f, --use_flash_attn | flag | No | Enable flash attention |
Outputs
| Name | Type | Description |
|---|---|---|
| checkpoint | directory | Model checkpoint saved to --save_dir/modeling or --save_dir/lora |
| tensorboard_logs | directory | Training loss, learning rate, and gradient norm logs |
| config_file | JSON | Training configuration saved to --config_file |
Usage Examples
# Fine-tune with LoRA on 4 GPUs using ZeRO-2:
# torchrun --nproc_per_node=4 lora_finetune.py \
# -m deepseek-ai/DeepSeek-V3 \
# -d ./train_data.jsonl \
# -p zero2 \
# --lora_rank 16 \
# --lora_alpha 32 \
# --lr 3e-4 \
# --max_length 4096 \
# --mixed_precision bf16 \
# -g -f
# Fine-tune MoE model with expert parallelism:
# torchrun --nproc_per_node=8 lora_finetune.py \
# -m deepseek-ai/DeepSeek-V3 \
# -d ./train_data.jsonl \
# -p moe \
# --ep 4 \
# --tp 2
Key Features
- LoRA Support - Applies low-rank adapters via PEFT; for DeepSeek V3 targets gate_proj, up_proj, down_proj modules
- MoE Parallelism - Supports expert parallelism via MoeHybridParallelPlugin for efficient MoE model training
- Pipeline Parallelism - Handles PP training loop with booster.execute_pipeline
- Model Loading - Uses from_config initialization followed by booster.load_model for compatibility with LoRA and lazy initialization
- DeepSeek V3 Compatibility - Special handling for DeepSeek V3 MoE inference method unwrapping