Principle:LLMBook zh LLMBook zh github io LLaMA Model Architecture
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Architecture, NLP |
| Last Updated | 2026-02-08 04:29 GMT |
Overview
Decoder-only Transformer architecture combining token embeddings, stacked decoder layers with Pre-Norm RMSNorm, and a final normalization layer to produce contextual hidden representations.
Description
The LLaMA Model Architecture defines the full forward pass of a decoder-only Transformer. It consists of three main components: (1) a token embedding layer that converts input IDs to dense vectors, (2) a stack of identical decoder layers (each containing self-attention and feed-forward sub-layers with RMSNorm and residual connections), and (3) a final RMSNorm applied to the output hidden states. The architecture uses a causal attention mask to ensure autoregressive generation. LLaMA introduced several design choices that became standard: Pre-Norm (applying normalization before each sub-layer rather than after), RMSNorm instead of LayerNorm, RoPE for position encoding, and SwiGLU activations in the FFN.
Usage
Use this principle when understanding the overall structure of LLaMA-family models and how the individual components (RMSNorm, RoPE, decoder layers) compose into a complete model. This is the top-level architecture that orchestrates embedding, sequential layer processing, and final normalization.
Theoretical Basis
The LLaMA forward pass is:
Failed to parse (syntax error): {\displaystyle h_0 = \text{Embed}(\text{input\_ids}) } Failed to parse (syntax error): {\displaystyle h_l = \text{DecoderLayer}_l(h_{l-1}, \text{causal\_mask}, \text{position\_ids}) \quad \text{for } l = 1, \ldots, N }
Where each DecoderLayer applies Pre-Norm attention and Pre-Norm FFN with residual connections.
Pseudo-code Logic:
# Abstract algorithm description (NOT real implementation)
hidden = embed_tokens(input_ids)
causal_mask = build_causal_mask(seq_len)
for layer in decoder_layers:
hidden = layer(hidden, causal_mask, position_ids)
output = rms_norm(hidden)