Principle:Hpcaitech ColossalAI SFT Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Architecture |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A model initialization pattern that loads pretrained causal language models with optional LoRA adapter injection for parameter-efficient supervised fine-tuning.
Description
SFT Model Loading combines HuggingFace's AutoModelForCausalLM with ColossalAI's lazy initialization and optional LoRA (Low-Rank Adaptation) injection. The process loads pretrained weights in a memory-efficient manner using LazyInitContext to defer actual tensor allocation until the model is placed on the correct device by the Booster.
When LoRA is enabled, the model's linear layers are augmented with low-rank decomposition matrices (A and B), enabling fine-tuning with a fraction of the total parameters while keeping the base model frozen.
Usage
Use this principle when loading a pretrained LLM for supervised fine-tuning. Choose LoRA when GPU memory is limited or when you want to preserve the base model's capabilities while learning task-specific behavior.
Theoretical Basis
The loading process follows these steps:
- Lazy Initialization: Model architecture is instantiated with meta tensors (no memory allocated)
- Weight Loading: Pretrained weights are loaded from disk or HuggingFace Hub
- LoRA Injection (Optional): For each target linear layer, inject trainable low-rank matrices: where r is the rank and alpha is the scaling factor
- Tokenizer Setup: Load tokenizer and configure special tokens (pad_token, etc.)