Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct Causal LM Loading

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Deep Learning, Natural Language Processing
Last Updated 2026-02-07 00:00 GMT

Overview

Causal LM loading is the process of initializing a pre-trained autoregressive language model with optional quantization and parameter-efficient fine-tuning adapters for training.

Description

Before fine-tuning can begin, the pre-trained model weights must be loaded into memory. This principle covers the considerations and techniques involved in loading causal language models for SFT:

Standard loading uses AutoModelForCausalLM.from_pretrained() from HuggingFace Transformers. The model is loaded in bfloat16 precision to reduce memory usage while maintaining training stability. The specific model revision (commit hash, branch, or tag) can be pinned for reproducibility.

Flash Attention is a memory-efficient attention implementation that reduces the quadratic memory cost of self-attention to linear. When use_flash_attn=True, the model is loaded with attn_implementation="flash_attention_2", which requires the flash-attn package. An alternative is SDPA (Scaled Dot Product Attention), PyTorch's native efficient attention.

LoRA (Low-Rank Adaptation) reduces the number of trainable parameters by injecting low-rank decomposition matrices into the model's attention layers. Instead of fine-tuning all parameters, only the LoRA matrices (rank r, scaling factor alpha) are trained, dramatically reducing memory requirements.

QLoRA (Quantized LoRA) combines LoRA with 4-bit quantization of the base model. The base model is loaded in NF4 (NormalFloat4) format using bitsandbytes, reducing the memory footprint by approximately 4x. Only the LoRA adapter weights are trained in full precision.

Embedding resizing ensures the model's embedding layer can handle any new tokens added to the tokenizer (e.g., special chat tokens). The embedding size is padded to a multiple of 8 for tensor core efficiency on GPUs.

Usage

Use this when initializing a model for fine-tuning. Choose the loading strategy based on available hardware:

  • Full fine-tuning: When GPU memory is sufficient for the full model (standard loading)
  • LoRA: When memory is limited but full-precision base model fits
  • QLoRA: When even the full-precision model does not fit in GPU memory

Theoretical Basis

LoRA decomposition: For a weight matrix W of dimension d x d, LoRA represents the update as:

W' = W + (alpha / r) * B * A

where:
  W: original frozen weight matrix (d x d)
  A: low-rank matrix (d x r), initialized randomly
  B: low-rank matrix (r x d), initialized to zero
  r: rank (typically 16-64, much smaller than d)
  alpha: scaling factor (typically 16-32)

Only A and B are trained, reducing the parameter count from d^2 to 2*d*r.

QLoRA quantization: The base model weights are quantized to NF4:

W_quantized = quantize_nf4(W)  # 4-bit per parameter
W_dequantized = dequantize(W_quantized)  # on-the-fly during forward pass
gradient is computed w.r.t. LoRA parameters only

Double quantization further reduces memory by quantizing the quantization constants themselves.

Memory comparison:

Full fine-tuning (bf16):  ~2 bytes/param * N_params
LoRA (bf16):              ~2 bytes/param * N_params (frozen) + 2 * d * r * n_layers (trainable)
QLoRA (nf4 + bf16 LoRA):  ~0.5 bytes/param * N_params (frozen) + 2 * d * r * n_layers (trainable)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment