Principle:Microsoft LoRA GPT2 LoRA Model Configuration
Overview
GPT2 LoRA Model Configuration is the principle of configuring a GPT-2 transformer model with Low-Rank Adaptation (LoRA) parameters injected into its attention layers. The configuration object controls both the standard transformer architecture hyperparameters (embedding dimension, number of layers, number of heads) and the LoRA-specific hyperparameters (rank, scaling alpha, dropout). This unified configuration approach allows the same model class to instantiate both baseline GPT-2 models (when lora_attn_dim=0) and LoRA-augmented models (when lora_attn_dim > 0).
Description
Transformer Architecture Parameters
The GPT-2 architecture is controlled by three primary parameters:
- n_embd -- The hidden embedding dimension (also called
d_model). - n_layer -- The number of transformer blocks stacked sequentially.
- n_head -- The number of attention heads in multi-head attention.
These three parameters fully determine the model capacity. The repository defines three standard presets:
| Model Card | n_embd | n_layer | n_head | Approx. Parameters |
|---|---|---|---|---|
| gpt2.sm | 768 | 12 | 12 | 117M |
| gpt2.md | 1024 | 24 | 16 | 345M |
| gpt2.lg | 1280 | 36 | 20 | 774M |
Additional architecture parameters include:
- vocab_size (default: 50257) -- The GPT-2 BPE vocabulary size.
- n_positions (default: 1024) -- Maximum sequence length supported by positional embeddings.
- n_ctx (default: 1024) -- Context window size for the causal attention mask.
- layer_norm_epsilon (default: 1e-5) -- Epsilon for layer normalization numerical stability.
- initializer_range (default: 0.02) -- Standard deviation for weight initialization.
LoRA Hyperparameters
LoRA injects trainable low-rank matrices into the attention layers while freezing the pretrained weights. The key LoRA parameters are:
- lora_attn_dim (default: 0) -- The rank r of the low-rank decomposition. When set to 0, no LoRA adaptation is applied. Typical values are 1, 2, 4, or 8.
- lora_attn_alpha (default: 128) -- The scaling factor alpha. The LoRA output is scaled by
alpha / r, allowing the learning rate to remain stable across different rank values. - lora_dropout (default: 0.0) -- Dropout probability applied to the LoRA input before the low-rank projection.
- lora_r_dropout (default: 0.0) -- Additional dropout parameter for LoRA layers.
- fix_dropout (default: 0.0) -- Fixed dropout applied to the model.
LoRA Injection into Attention
LoRA is applied to the query (Q) and value (V) projections of the multi-head attention mechanism via lora.MergedLinear. The standard GPT-2 attention layer computes Q, K, V through a single fused linear projection (c_attn) that maps from n_embd to 3 * n_embd. LoRA adapts this fused projection with the following configuration:
self.c_attn = lora.MergedLinear(
nx, n_state * 3,
r=config.lora_attn_dim,
lora_alpha=config.lora_attn_alpha,
lora_dropout=config.lora_dropout,
enable_lora=[True, False, True],
fan_in_fan_out=True,
merge_weights=False
)
The critical configuration choices are:
- enable_lora=[True, False, True] -- This tuple controls which of the three output groups (Q, K, V) receive LoRA adaptation. Only Q and V are adapted; K is left unchanged, following the findings in the LoRA paper (Hu et al., 2021).
- fan_in_fan_out=True -- This indicates that the weight matrix is stored in transposed form (fan_in x fan_out), matching GPT-2's Conv1D convention where the weight shape is
(n_embd, 3 * n_embd). - merge_weights=False -- During training, the LoRA weights are kept separate from the pretrained weights, enabling efficient checkpoint saving of only the LoRA parameters.
Model Architecture
The full model is composed of:
- GPT2Model -- The transformer backbone containing token embeddings (
wte), position embeddings (wpe), a stack ofn_layertransformer blocks, and a final layer norm (ln_f). - GPT2LMHead -- A tied language model head that reuses the token embedding weights for output projection.
- GPT2LMModel -- The complete language model combining GPT2Model and GPT2LMHead, with support for label smoothing and accuracy reporting.
Theoretical Basis
The LoRA paper demonstrates that the weight updates during fine-tuning have a low intrinsic rank. By decomposing the weight update as delta_W = B * A where B in R^(d x r) and A in R^(r x k), LoRA reduces the number of trainable parameters from d * k to (d + k) * r. For GPT-2 Medium with rank 4, this means each attention layer adds only 2 * (1024 + 1024) * 4 = 16,384 trainable parameters (for Q and V combined), compared to 2 * 1024 * 1024 = 2,097,152 parameters in the original Q and V projections.
Metadata
| Field | Value |
|---|---|
| Source | microsoft/LoRA |
| Domains | Model Configuration, NLG |
| Type | External Tool Doc |
| Last Updated | 2026-02-10 |