Principle:LLMBook zh LLMBook zh github io Causal LM Model Initialization

Knowledge Sources	LLaMA: Open and Efficient Foundation Language Models HuggingFace Transformers LLMBook-zh
Domains	NLP, Deep_Learning
Last Updated	2026-02-08 00:00 GMT

Overview

The process of loading a pre-trained causal language model architecture with its weights for further training or inference.

Description

Causal LM Model Initialization involves loading a Transformer-based decoder-only model (such as LLaMA) from a pre-trained checkpoint. The model is composed of key architectural components including RMSNorm, Rotary Position Embeddings (RoPE), multi-head self-attention, and SwiGLU feed-forward networks arranged in a Pre-Norm decoder layer pattern. Loading from a pre-trained checkpoint allows continued pre-training or fine-tuning from an established starting point.

Usage

Use this principle when starting pre-training from an existing checkpoint or when initializing a model for fine-tuning. FlashAttention-2 should be enabled for training efficiency on supported hardware.

Theoretical Basis

A causal language model consists of:

Embedding layer: Maps token IDs to dense vectors.
Decoder layers: Each layer applies Pre-Norm (RMSNorm), self-attention with causal masking, and a feed-forward network with residual connections.
Output head: Projects hidden states to vocabulary logits.

The LLaMA architecture specifically uses:

RMSNorm instead of LayerNorm for normalization
RoPE for position encoding (applied within attention)
SwiGLU activation in the feed-forward network
Pre-Norm pattern (normalize before attention/FFN, not after)

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_AutoModelForCausalLM_From_Pretrained_Pretraining

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment