Principle:LLMBook zh LLMBook zh github io LLaMA Decoder Layer
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Architecture |
| Last Updated | 2026-02-08 04:29 GMT |
Overview
Single Transformer decoder block implementing Pre-Norm self-attention and feed-forward computation with RMSNorm and residual connections.
Description
The LLaMA Decoder Layer is the fundamental repeating unit in the LLaMA architecture. Each layer applies two sub-computations in sequence: (1) self-attention preceded by RMSNorm (Pre-Norm) with a residual connection, and (2) a feed-forward network (MLP) preceded by RMSNorm with a residual connection. The Pre-Norm design (normalizing before each sub-layer rather than after) improves training stability for deep networks. The LLaMA decoder layer uses RMSNorm instead of LayerNorm, and the MLP uses SwiGLU activation. Multiple decoder layers are stacked to form the complete LLaMA model.
Usage
Use this principle when understanding the internal structure of each Transformer block in LLaMA-family models. The decoder layer is the core building block that is repeated times (e.g., 32 layers for LLaMA-7B, 80 layers for LLaMA-70B). Understanding this layer is essential for grasping how attention and feed-forward computation interact with normalization and residual connections.
Theoretical Basis
Each decoder layer computes:
This is the Pre-Norm Transformer pattern where normalization is applied before each sub-layer rather than after.
Pseudo-code Logic:
# Abstract algorithm description (NOT real implementation)
# Sub-layer 1: Pre-Norm Self-Attention + Residual
residual = hidden_states
hidden_states = rms_norm_1(hidden_states)
hidden_states = self_attention(hidden_states, mask, position_ids)
hidden_states = residual + hidden_states
# Sub-layer 2: Pre-Norm FFN + Residual
residual = hidden_states
hidden_states = rms_norm_2(hidden_states)
hidden_states = mlp(hidden_states)
hidden_states = residual + hidden_states