Principle:LaurentMazare Tch rs Transformer Architecture

Knowledge Sources	Attention Is All You Need LLaMA: Open and Efficient Foundation Language Models tch-rs
Domains	NLP, Model_Architecture
Last Updated	2026-02-08 14:00 GMT

Overview

Decoder-only transformer architecture using stacked attention blocks with RMSNorm and SwiGLU for autoregressive language modeling.

Description

The LLaMA transformer follows a decoder-only architecture with N identical blocks, each containing a causal self-attention layer with rotary position embeddings (RoPE) and a feed-forward network using SwiGLU activation. Pre-normalization with RMSNorm (instead of LayerNorm) is applied before each sub-layer. The model begins with a token embedding layer and ends with RMSNorm followed by a linear projection to vocabulary logits.

Usage

Use this architecture pattern for building autoregressive language models. The implementation provides 7B, 13B, and other model configurations.

Theoretical Basis

LLaMA Architecture:
  Input tokens [batch, seq_len]
    → Token Embedding [batch, seq_len, n_embd]
    → N × Transformer Block:
        x = x + Attention(RMSNorm(x), freqs_cis)    [Pre-norm + Causal Self-Attention with RoPE]
        x = x + MLP(RMSNorm(x))                     [Pre-norm + SwiGLU FFN]
    → RMSNorm(x)
    → Linear projection → [batch, 1, vocab_size]     [Last position only for generation]

Config (7B): n_layer=32, n_head=32, n_embd=4096, vocab_size=32000

Related Pages

Implemented By

Implementation:LaurentMazare_Tch_rs_Llama_New

Uses Heuristic

Heuristic:LaurentMazare_Tch_rs_Hidden_Dimension_Alignment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment