Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LLMBook zh LLMBook zh github io LLaMA Decoder Layer

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Architecture
Last Updated 2026-02-08 04:29 GMT

Overview

Single Transformer decoder block implementing Pre-Norm self-attention and feed-forward computation with RMSNorm and residual connections.

Description

The LLaMA Decoder Layer is the fundamental repeating unit in the LLaMA architecture. Each layer applies two sub-computations in sequence: (1) self-attention preceded by RMSNorm (Pre-Norm) with a residual connection, and (2) a feed-forward network (MLP) preceded by RMSNorm with a residual connection. The Pre-Norm design (normalizing before each sub-layer rather than after) improves training stability for deep networks. The LLaMA decoder layer uses RMSNorm instead of LayerNorm, and the MLP uses SwiGLU activation. Multiple decoder layers are stacked to form the complete LLaMA model.

Usage

Use this principle when understanding the internal structure of each Transformer block in LLaMA-family models. The decoder layer is the core building block that is repeated N times (e.g., 32 layers for LLaMA-7B, 80 layers for LLaMA-70B). Understanding this layer is essential for grasping how attention and feed-forward computation interact with normalization and residual connections.

Theoretical Basis

Each decoder layer computes:

h=h+SelfAttn(RMSNorm(h)) hout=h+MLP(RMSNorm(h))

This is the Pre-Norm Transformer pattern where normalization is applied before each sub-layer rather than after.

Pseudo-code Logic:

# Abstract algorithm description (NOT real implementation)
# Sub-layer 1: Pre-Norm Self-Attention + Residual
residual = hidden_states
hidden_states = rms_norm_1(hidden_states)
hidden_states = self_attention(hidden_states, mask, position_ids)
hidden_states = residual + hidden_states

# Sub-layer 2: Pre-Norm FFN + Residual
residual = hidden_states
hidden_states = rms_norm_2(hidden_states)
hidden_states = mlp(hidden_states)
hidden_states = residual + hidden_states

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment