Principle:Tensorflow Tfjs Transformer Backbone Construction
Summary
Transformer backbone construction involves building the core architecture of a GPT-2 language model: token embeddings, positional embeddings, N transformer decoder blocks, and final layer normalization. This is a library-agnostic concept: constructing a decoder-only transformer backbone that maps token sequences to contextualized hidden representations.
Theory
The GPT-2 backbone is a decoder-only transformer architecture. It transforms a sequence of token IDs into contextualized hidden-state vectors that encode both the token identity and its relationship to surrounding context.
The architecture consists of four major components:
- Token Embedding: Maps vocabulary IDs to dense vectors of dimension dmodel.
- Position Embedding: Adds learnable positional information to each token, enabling the model to understand sequence order.
- N Transformer Decoder Blocks: A stack of identical layers, each containing self-attention with causal masking and a feed-forward network.
- Final Layer Normalization: Normalizes the output of the last decoder block.
Transformer Decoder Block
Each decoder block follows a pre-norm architecture (GPT-2 style):
- LayerNorm → Multi-Head Self-Attention (with causal mask) → Residual Connection
- LayerNorm → Feed-Forward Network (FFN) → Residual Connection
The causal mask ensures that each position can only attend to previous positions (and itself), enforcing the autoregressive property required for language modeling.
Architecture Dimensions
| Hyperparameter | Symbol | GPT-2 Small | Description |
|---|---|---|---|
| Vocabulary Size | V | 50,257 | Number of tokens in the vocabulary |
| Number of Layers | N | 12 | Number of transformer decoder blocks |
| Number of Heads | h | 12 | Number of attention heads per layer |
| Hidden Dimension | dmodel | 768 | Dimensionality of token representations |
| Intermediate Dimension | dff | 3,072 | Inner dimension of the feed-forward network (typically 4 × dmodel) |
| Max Sequence Length | L | 1,024 | Maximum number of positions the model can process |
| Dropout | p | 0.1 | Dropout probability for regularization |
Information Flow
The data flows through the backbone as follows:
- Input token IDs: shape [batch, seq_len]
- Token embedding: [batch, seq_len] → [batch, seq_len, dmodel]
- Position embedding added: [batch, seq_len, dmodel]
- Through N decoder blocks: [batch, seq_len, dmodel] → [batch, seq_len, dmodel]
- Final layer norm: [batch, seq_len, dmodel]
Academic Reference
Language Models are Unsupervised Multitask Learners
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. This paper introduced GPT-2 and demonstrated that large-scale language models trained on diverse text can perform a wide range of NLP tasks without explicit supervision.
Key Properties
- Decoder-only: No encoder or cross-attention; uses only causal self-attention.
- Pre-norm architecture: Layer normalization is applied before (not after) each sub-layer, improving training stability.
- Learnable position embeddings: Unlike sinusoidal embeddings in the original Transformer, GPT-2 uses learnable position vectors.
- Scalable: The same architecture scales from GPT-2 Small (117M parameters) to GPT-2 XL (1.5B parameters) by varying hyperparameters.
Implementation
Implementation:Tensorflow_Tfjs_GPT2Backbone_Constructor
Domains
Sources
TensorFlow.js Language Models are Unsupervised Multitask Learners