Principle:Hpcaitech ColossalAI SFT Model Loading

Knowledge Sources	ColossalAI LoRA: Low-Rank Adaptation HuggingFace Transformers
Domains	NLP, Model_Architecture
Last Updated	2026-02-09 00:00 GMT

Overview

A model initialization pattern that loads pretrained causal language models with optional LoRA adapter injection for parameter-efficient supervised fine-tuning.

Description

SFT Model Loading combines HuggingFace's AutoModelForCausalLM with ColossalAI's lazy initialization and optional LoRA (Low-Rank Adaptation) injection. The process loads pretrained weights in a memory-efficient manner using LazyInitContext to defer actual tensor allocation until the model is placed on the correct device by the Booster.

When LoRA is enabled, the model's linear layers are augmented with low-rank decomposition matrices (A and B), enabling fine-tuning with a fraction of the total parameters while keeping the base model frozen.

Usage

Use this principle when loading a pretrained LLM for supervised fine-tuning. Choose LoRA when GPU memory is limited or when you want to preserve the base model's capabilities while learning task-specific behavior.

Theoretical Basis

The loading process follows these steps:

Lazy Initialization: Model architecture is instantiated with meta tensors (no memory allocated)
Weight Loading: Pretrained weights are loaded from disk or HuggingFace Hub
LoRA Injection (Optional): For each target linear layer, inject trainable low-rank matrices: $W^{'} = W + \frac{α}{r} B \cdot A$ where r is the rank and alpha is the scaling factor
Tokenizer Setup: Load tokenizer and configure special tokens (pad_token, etc.)

Related Pages

Implemented By

Implementation:Hpcaitech_ColossalAI_AutoModelForCausalLM_SFT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment