Principle:LaurentMazare Tch rs Transformer Architecture
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Architecture |
| Last Updated | 2026-02-08 14:00 GMT |
Overview
Decoder-only transformer architecture using stacked attention blocks with RMSNorm and SwiGLU for autoregressive language modeling.
Description
The LLaMA transformer follows a decoder-only architecture with N identical blocks, each containing a causal self-attention layer with rotary position embeddings (RoPE) and a feed-forward network using SwiGLU activation. Pre-normalization with RMSNorm (instead of LayerNorm) is applied before each sub-layer. The model begins with a token embedding layer and ends with RMSNorm followed by a linear projection to vocabulary logits.
Usage
Use this architecture pattern for building autoregressive language models. The implementation provides 7B, 13B, and other model configurations.
Theoretical Basis
LLaMA Architecture:
Input tokens [batch, seq_len]
→ Token Embedding [batch, seq_len, n_embd]
→ N × Transformer Block:
x = x + Attention(RMSNorm(x), freqs_cis) [Pre-norm + Causal Self-Attention with RoPE]
x = x + MLP(RMSNorm(x)) [Pre-norm + SwiGLU FFN]
→ RMSNorm(x)
→ Linear projection → [batch, 1, vocab_size] [Last position only for generation]
Config (7B): n_layer=32, n_head=32, n_embd=4096, vocab_size=32000