Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft LoRA GPT2 LoRA Model Configuration

From Leeroopedia


Overview

GPT2 LoRA Model Configuration is the principle of configuring a GPT-2 transformer model with Low-Rank Adaptation (LoRA) parameters injected into its attention layers. The configuration object controls both the standard transformer architecture hyperparameters (embedding dimension, number of layers, number of heads) and the LoRA-specific hyperparameters (rank, scaling alpha, dropout). This unified configuration approach allows the same model class to instantiate both baseline GPT-2 models (when lora_attn_dim=0) and LoRA-augmented models (when lora_attn_dim > 0).

Description

Transformer Architecture Parameters

The GPT-2 architecture is controlled by three primary parameters:

  • n_embd -- The hidden embedding dimension (also called d_model).
  • n_layer -- The number of transformer blocks stacked sequentially.
  • n_head -- The number of attention heads in multi-head attention.

These three parameters fully determine the model capacity. The repository defines three standard presets:

Model Card n_embd n_layer n_head Approx. Parameters
gpt2.sm 768 12 12 117M
gpt2.md 1024 24 16 345M
gpt2.lg 1280 36 20 774M

Additional architecture parameters include:

  • vocab_size (default: 50257) -- The GPT-2 BPE vocabulary size.
  • n_positions (default: 1024) -- Maximum sequence length supported by positional embeddings.
  • n_ctx (default: 1024) -- Context window size for the causal attention mask.
  • layer_norm_epsilon (default: 1e-5) -- Epsilon for layer normalization numerical stability.
  • initializer_range (default: 0.02) -- Standard deviation for weight initialization.

LoRA Hyperparameters

LoRA injects trainable low-rank matrices into the attention layers while freezing the pretrained weights. The key LoRA parameters are:

  • lora_attn_dim (default: 0) -- The rank r of the low-rank decomposition. When set to 0, no LoRA adaptation is applied. Typical values are 1, 2, 4, or 8.
  • lora_attn_alpha (default: 128) -- The scaling factor alpha. The LoRA output is scaled by alpha / r, allowing the learning rate to remain stable across different rank values.
  • lora_dropout (default: 0.0) -- Dropout probability applied to the LoRA input before the low-rank projection.
  • lora_r_dropout (default: 0.0) -- Additional dropout parameter for LoRA layers.
  • fix_dropout (default: 0.0) -- Fixed dropout applied to the model.

LoRA Injection into Attention

LoRA is applied to the query (Q) and value (V) projections of the multi-head attention mechanism via lora.MergedLinear. The standard GPT-2 attention layer computes Q, K, V through a single fused linear projection (c_attn) that maps from n_embd to 3 * n_embd. LoRA adapts this fused projection with the following configuration:

self.c_attn = lora.MergedLinear(
    nx, n_state * 3,
    r=config.lora_attn_dim,
    lora_alpha=config.lora_attn_alpha,
    lora_dropout=config.lora_dropout,
    enable_lora=[True, False, True],
    fan_in_fan_out=True,
    merge_weights=False
)

The critical configuration choices are:

  • enable_lora=[True, False, True] -- This tuple controls which of the three output groups (Q, K, V) receive LoRA adaptation. Only Q and V are adapted; K is left unchanged, following the findings in the LoRA paper (Hu et al., 2021).
  • fan_in_fan_out=True -- This indicates that the weight matrix is stored in transposed form (fan_in x fan_out), matching GPT-2's Conv1D convention where the weight shape is (n_embd, 3 * n_embd).
  • merge_weights=False -- During training, the LoRA weights are kept separate from the pretrained weights, enabling efficient checkpoint saving of only the LoRA parameters.

Model Architecture

The full model is composed of:

  • GPT2Model -- The transformer backbone containing token embeddings (wte), position embeddings (wpe), a stack of n_layer transformer blocks, and a final layer norm (ln_f).
  • GPT2LMHead -- A tied language model head that reuses the token embedding weights for output projection.
  • GPT2LMModel -- The complete language model combining GPT2Model and GPT2LMHead, with support for label smoothing and accuracy reporting.

Theoretical Basis

The LoRA paper demonstrates that the weight updates during fine-tuning have a low intrinsic rank. By decomposing the weight update as delta_W = B * A where B in R^(d x r) and A in R^(r x k), LoRA reduces the number of trainable parameters from d * k to (d + k) * r. For GPT-2 Medium with rank 4, this means each attention layer adds only 2 * (1024 + 1024) * 4 = 16,384 trainable parameters (for Q and V combined), compared to 2 * 1024 * 1024 = 2,097,152 parameters in the original Q and V projections.

Metadata

Field Value
Source microsoft/LoRA
Domains Model Configuration, NLG
Type External Tool Doc
Last Updated 2026-02-10

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment