Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI LoRA Finetune Script

From Leeroopedia


Knowledge Sources
Domains Fine-tuning, LoRA, Distributed Training, MoE
Last Updated 2026-02-09 00:00 GMT

Overview

lora_finetune.py is a training script for supervised fine-tuning (SFT) of large language models, including Mixture-of-Experts (MoE) models like DeepSeek V3/R1, with optional LoRA (Low-Rank Adaptation) support.

Description

This script implements a complete supervised fine-tuning pipeline using ColossalAI's Booster framework. It supports multiple parallelism strategies including DDP, Gemini, ZeRO-2, 3D parallelism (TP+PP+SP), and MoE hybrid parallelism. The script loads a pretrained causal language model, optionally applies LoRA adapters via PEFT, configures distributed training with gradient checkpointing, mixed precision (fp16/bf16), gradient accumulation, and cosine annealing learning rate scheduling. It handles both pipeline-parallel and non-pipeline-parallel training loops with TensorBoard logging.

Usage

Use this script for supervised fine-tuning of causal language models on conversational datasets. It is particularly suited for fine-tuning large MoE models like DeepSeek V3 with LoRA to reduce trainable parameters. Launch with torchrun or ColossalAI launcher for distributed execution.

Code Reference

Source Location

Signature

def all_reduce_mean(loss: torch.Tensor, plugin: Plugin) -> torch.Tensor

def train(args) -> None

Import

# This is a standalone training script, typically run directly:
# torchrun --nproc_per_node=<N> lora_finetune.py -m <model_path> -d <dataset_path>

I/O Contract

Inputs

Name Type Required Description
-m, --pretrained str Yes Path to the pretrained model
-d, --dataset str Yes Path to raw JSONL dataset for training
-p, --plugin str No Plugin choice: gemini, gemini_auto, zero2, zero2_cpu, 3d, ddp, moe (default: zero2)
--save_dir str No Checkpoint directory (default: checkpoint_dir)
--tensorboard_dir str No TensorBoard log directory
-n, --num_epochs int No Number of training epochs (default: 1)
--accumulation_steps int No Gradient accumulation steps (default: 1)
--batch_size int No Batch size per process (default: 2)
--lr float No Learning rate (default: 3e-4)
--max_length int No Model max sequence length (default: 8192)
--mixed_precision str No Mixed precision mode: fp16 or bf16 (default: bf16)
--grad_clip float No Gradient clipping value (default: 1.0)
--lora_rank int No LoRA rank; 0 disables LoRA (default: 0)
--lora_alpha int No LoRA alpha scaling (default: 8)
--tp int No Tensor parallelism size (default: 1)
--pp int No Pipeline parallelism size (default: 1)
--sp int No Sequence parallelism size (default: 1)
--ep int No Expert parallelism size for MoE (default: 1)
-g, --use_grad_checkpoint flag No Enable gradient checkpointing
-f, --use_flash_attn flag No Enable flash attention

Outputs

Name Type Description
checkpoint directory Model checkpoint saved to --save_dir/modeling or --save_dir/lora
tensorboard_logs directory Training loss, learning rate, and gradient norm logs
config_file JSON Training configuration saved to --config_file

Usage Examples

# Fine-tune with LoRA on 4 GPUs using ZeRO-2:
# torchrun --nproc_per_node=4 lora_finetune.py \
#     -m deepseek-ai/DeepSeek-V3 \
#     -d ./train_data.jsonl \
#     -p zero2 \
#     --lora_rank 16 \
#     --lora_alpha 32 \
#     --lr 3e-4 \
#     --max_length 4096 \
#     --mixed_precision bf16 \
#     -g -f

# Fine-tune MoE model with expert parallelism:
# torchrun --nproc_per_node=8 lora_finetune.py \
#     -m deepseek-ai/DeepSeek-V3 \
#     -d ./train_data.jsonl \
#     -p moe \
#     --ep 4 \
#     --tp 2

Key Features

  • LoRA Support - Applies low-rank adapters via PEFT; for DeepSeek V3 targets gate_proj, up_proj, down_proj modules
  • MoE Parallelism - Supports expert parallelism via MoeHybridParallelPlugin for efficient MoE model training
  • Pipeline Parallelism - Handles PP training loop with booster.execute_pipeline
  • Model Loading - Uses from_config initialization followed by booster.load_model for compatibility with LoRA and lazy initialization
  • DeepSeek V3 Compatibility - Special handling for DeepSeek V3 MoE inference method unwrapping

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment