Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Microsoft LoRA Fan In Fan Out Transpose

From Leeroopedia




Knowledge Sources
Domains Debugging, LLMs
Last Updated 2026-02-10 05:30 GMT

Overview

When adapting GPT-2 attention layers, set `fan_in_fan_out=True` because GPT-2 uses a custom Conv1D that stores weights transposed compared to standard `nn.Linear`.

Description

GPT-2's original implementation uses a custom `Conv1D` class (not `nn.Conv1d`) that stores weight matrices as `(fan_in, fan_out)` — the transpose of PyTorch's standard `nn.Linear` which stores as `(fan_out, fan_in)`. When replacing GPT-2's attention projection with LoRA's `Linear` or `MergedLinear`, the `fan_in_fan_out` flag must be set to `True`. This causes the LoRA layer to transpose the weight matrix during initialization and apply a transpose function `T(w)` during forward passes and weight merging.

Usage

Set `fan_in_fan_out=True` whenever applying LoRA to GPT-2 or any model that uses a Conv1D-style weight layout `(input_dim, output_dim)` instead of the standard Linear layout `(output_dim, input_dim)`. Forgetting this flag will produce incorrect outputs because matrix multiplications will use the wrong dimension ordering.

The Insight (Rule of Thumb)

  • Action: Set `fan_in_fan_out=True` in `lora.Linear` or `lora.MergedLinear` when the target layer stores weights as `(fan_in, fan_out)`.
  • Value: Boolean flag, `True` for GPT-2 Conv1D layers, `False` for standard nn.Linear.
  • Trade-off: None — this is a correctness requirement, not an optimization choice.
  • Detection: Check if the original layer's weight shape is `(input_features, output_features)` instead of `(output_features, input_features)`.

Reasoning

GPT-2's `Conv1D` class computes `x @ weight + bias` (with weight shape `(nx, nf)`), while PyTorch's `nn.Linear` computes `x @ weight.T + bias` (with weight shape `(nf, nx)`). The `fan_in_fan_out` flag ensures the LoRA layer correctly handles this transposition. Without it, the pretrained weights would be applied incorrectly, and the LoRA update BA would be added in the wrong orientation.

Code Evidence

GPT-2 Conv1D weight layout from `examples/NLG/src/model.py:67-80`:

class Conv1D(nn.Module):
    def __init__(self, nf, nx):
        super(Conv1D, self).__init__()
        self.nf = nf
        w = torch.empty(nx, nf)  # NOTE: (fan_in, fan_out) layout
        nn.init.normal_(w, std=0.02)
        self.weight = Parameter(w)
        self.bias = Parameter(torch.zeros(nf))

    def forward(self, x):
        size_out = x.size()[:-1] + (self.nf,)
        x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)  # x @ weight, NOT x @ weight.T
        x = x.view(*size_out)
        return x

LoRA MergedLinear with fan_in_fan_out=True in attention from `examples/NLG/src/model.py:94-102`:

self.c_attn = lora.MergedLinear(
    nx, n_state * 3,
    r=config.lora_attn_dim,
    lora_alpha=config.lora_attn_alpha,
    lora_dropout=config.lora_dropout,
    enable_lora=[True, False, True],
    fan_in_fan_out=True,
    merge_weights=False
)

Transpose helper function in LoRA Linear from `loralib/layers.py:99,116-117,128-129,145-146`:

fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
# ...
if fan_in_fan_out:
    self.weight.data = self.weight.data.transpose(0, 1)
# ...
def T(w):
    return w.transpose(0, 1) if self.fan_in_fan_out else w

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment