Principle:Microsoft LoRA GPT2 LoRA Model Configuration

Overview

GPT2 LoRA Model Configuration is the principle of configuring a GPT-2 transformer model with Low-Rank Adaptation (LoRA) parameters injected into its attention layers. The configuration object controls both the standard transformer architecture hyperparameters (embedding dimension, number of layers, number of heads) and the LoRA-specific hyperparameters (rank, scaling alpha, dropout). This unified configuration approach allows the same model class to instantiate both baseline GPT-2 models (when lora_attn_dim=0) and LoRA-augmented models (when lora_attn_dim > 0).

Description

Transformer Architecture Parameters

The GPT-2 architecture is controlled by three primary parameters:

n_embd -- The hidden embedding dimension (also called d_model).
n_layer -- The number of transformer blocks stacked sequentially.
n_head -- The number of attention heads in multi-head attention.

These three parameters fully determine the model capacity. The repository defines three standard presets:

Model Card	n_embd	n_layer	n_head	Approx. Parameters
gpt2.sm	768	12	12	117M
gpt2.md	1024	24	16	345M
gpt2.lg	1280	36	20	774M

Additional architecture parameters include:

vocab_size (default: 50257) -- The GPT-2 BPE vocabulary size.
n_positions (default: 1024) -- Maximum sequence length supported by positional embeddings.
n_ctx (default: 1024) -- Context window size for the causal attention mask.
layer_norm_epsilon (default: 1e-5) -- Epsilon for layer normalization numerical stability.
initializer_range (default: 0.02) -- Standard deviation for weight initialization.

LoRA Hyperparameters

LoRA injects trainable low-rank matrices into the attention layers while freezing the pretrained weights. The key LoRA parameters are:

lora_attn_dim (default: 0) -- The rank r of the low-rank decomposition. When set to 0, no LoRA adaptation is applied. Typical values are 1, 2, 4, or 8.
lora_attn_alpha (default: 128) -- The scaling factor alpha. The LoRA output is scaled by alpha / r, allowing the learning rate to remain stable across different rank values.
lora_dropout (default: 0.0) -- Dropout probability applied to the LoRA input before the low-rank projection.
lora_r_dropout (default: 0.0) -- Additional dropout parameter for LoRA layers.
fix_dropout (default: 0.0) -- Fixed dropout applied to the model.

LoRA Injection into Attention

LoRA is applied to the query (Q) and value (V) projections of the multi-head attention mechanism via lora.MergedLinear. The standard GPT-2 attention layer computes Q, K, V through a single fused linear projection (c_attn) that maps from n_embd to 3 * n_embd. LoRA adapts this fused projection with the following configuration:

self.c_attn = lora.MergedLinear(
    nx, n_state * 3,
    r=config.lora_attn_dim,
    lora_alpha=config.lora_attn_alpha,
    lora_dropout=config.lora_dropout,
    enable_lora=[True, False, True],
    fan_in_fan_out=True,
    merge_weights=False
)

The critical configuration choices are:

enable_lora=[True, False, True] -- This tuple controls which of the three output groups (Q, K, V) receive LoRA adaptation. Only Q and V are adapted; K is left unchanged, following the findings in the LoRA paper (Hu et al., 2021).
fan_in_fan_out=True -- This indicates that the weight matrix is stored in transposed form (fan_in x fan_out), matching GPT-2's Conv1D convention where the weight shape is (n_embd, 3 * n_embd).
merge_weights=False -- During training, the LoRA weights are kept separate from the pretrained weights, enabling efficient checkpoint saving of only the LoRA parameters.

Model Architecture

The full model is composed of:

GPT2Model -- The transformer backbone containing token embeddings (wte), position embeddings (wpe), a stack of n_layer transformer blocks, and a final layer norm (ln_f).
GPT2LMHead -- A tied language model head that reuses the token embedding weights for output projection.
GPT2LMModel -- The complete language model combining GPT2Model and GPT2LMHead, with support for label smoothing and accuracy reporting.

Theoretical Basis

The LoRA paper demonstrates that the weight updates during fine-tuning have a low intrinsic rank. By decomposing the weight update as delta_W = B * A where B in R^(d x r) and A in R^(r x k), LoRA reduces the number of trainable parameters from d * k to (d + k) * r. For GPT-2 Medium with rank 4, this means each attention layer adds only 2 * (1024 + 1024) * 4 = 16,384 trainable parameters (for Q and V combined), compared to 2 * 1024 * 1024 = 2,097,152 parameters in the original Q and V projections.

Metadata

Field	Value
Source	microsoft/LoRA
Domains	Model Configuration, NLG
Type	External Tool Doc
Last Updated	2026-02-10

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment