Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LLMBook zh LLMBook zh github io LLaMA Model Architecture

From Leeroopedia
Revision as of 17:42, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/LLMBook_zh_LLMBook_zh_github_io_LLaMA_Model_Architecture.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Deep_Learning, Model_Architecture, NLP
Last Updated 2026-02-08 04:29 GMT

Overview

Decoder-only Transformer architecture combining token embeddings, stacked decoder layers with Pre-Norm RMSNorm, and a final normalization layer to produce contextual hidden representations.

Description

The LLaMA Model Architecture defines the full forward pass of a decoder-only Transformer. It consists of three main components: (1) a token embedding layer that converts input IDs to dense vectors, (2) a stack of N identical decoder layers (each containing self-attention and feed-forward sub-layers with RMSNorm and residual connections), and (3) a final RMSNorm applied to the output hidden states. The architecture uses a causal attention mask to ensure autoregressive generation. LLaMA introduced several design choices that became standard: Pre-Norm (applying normalization before each sub-layer rather than after), RMSNorm instead of LayerNorm, RoPE for position encoding, and SwiGLU activations in the FFN.

Usage

Use this principle when understanding the overall structure of LLaMA-family models and how the individual components (RMSNorm, RoPE, decoder layers) compose into a complete model. This is the top-level architecture that orchestrates embedding, sequential layer processing, and final normalization.

Theoretical Basis

The LLaMA forward pass is:

Failed to parse (syntax error): {\displaystyle h_0 = \text{Embed}(\text{input\_ids}) } Failed to parse (syntax error): {\displaystyle h_l = \text{DecoderLayer}_l(h_{l-1}, \text{causal\_mask}, \text{position\_ids}) \quad \text{for } l = 1, \ldots, N } output=RMSNorm(hN)

Where each DecoderLayer applies Pre-Norm attention and Pre-Norm FFN with residual connections.

Pseudo-code Logic:

# Abstract algorithm description (NOT real implementation)
hidden = embed_tokens(input_ids)
causal_mask = build_causal_mask(seq_len)
for layer in decoder_layers:
    hidden = layer(hidden, causal_mask, position_ids)
output = rms_norm(hidden)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment