Principle:LLMBook zh LLMBook zh github io Causal LM Model Initialization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
The process of loading a pre-trained causal language model architecture with its weights for further training or inference.
Description
Causal LM Model Initialization involves loading a Transformer-based decoder-only model (such as LLaMA) from a pre-trained checkpoint. The model is composed of key architectural components including RMSNorm, Rotary Position Embeddings (RoPE), multi-head self-attention, and SwiGLU feed-forward networks arranged in a Pre-Norm decoder layer pattern. Loading from a pre-trained checkpoint allows continued pre-training or fine-tuning from an established starting point.
Usage
Use this principle when starting pre-training from an existing checkpoint or when initializing a model for fine-tuning. FlashAttention-2 should be enabled for training efficiency on supported hardware.
Theoretical Basis
A causal language model consists of:
- Embedding layer: Maps token IDs to dense vectors.
- Decoder layers: Each layer applies Pre-Norm (RMSNorm), self-attention with causal masking, and a feed-forward network with residual connections.
- Output head: Projects hidden states to vocabulary logits.
The LLaMA architecture specifically uses:
- RMSNorm instead of LayerNorm for normalization
- RoPE for position encoding (applied within attention)
- SwiGLU activation in the feed-forward network
- Pre-Norm pattern (normalize before attention/FFN, not after)