Heuristic:Microsoft LoRA Fan In Fan Out Transpose
| Knowledge Sources | |
|---|---|
| Domains | Debugging, LLMs |
| Last Updated | 2026-02-10 05:30 GMT |
Overview
When adapting GPT-2 attention layers, set `fan_in_fan_out=True` because GPT-2 uses a custom Conv1D that stores weights transposed compared to standard `nn.Linear`.
Description
GPT-2's original implementation uses a custom `Conv1D` class (not `nn.Conv1d`) that stores weight matrices as `(fan_in, fan_out)` — the transpose of PyTorch's standard `nn.Linear` which stores as `(fan_out, fan_in)`. When replacing GPT-2's attention projection with LoRA's `Linear` or `MergedLinear`, the `fan_in_fan_out` flag must be set to `True`. This causes the LoRA layer to transpose the weight matrix during initialization and apply a transpose function `T(w)` during forward passes and weight merging.
Usage
Set `fan_in_fan_out=True` whenever applying LoRA to GPT-2 or any model that uses a Conv1D-style weight layout `(input_dim, output_dim)` instead of the standard Linear layout `(output_dim, input_dim)`. Forgetting this flag will produce incorrect outputs because matrix multiplications will use the wrong dimension ordering.
The Insight (Rule of Thumb)
- Action: Set `fan_in_fan_out=True` in `lora.Linear` or `lora.MergedLinear` when the target layer stores weights as `(fan_in, fan_out)`.
- Value: Boolean flag, `True` for GPT-2 Conv1D layers, `False` for standard nn.Linear.
- Trade-off: None — this is a correctness requirement, not an optimization choice.
- Detection: Check if the original layer's weight shape is `(input_features, output_features)` instead of `(output_features, input_features)`.
Reasoning
GPT-2's `Conv1D` class computes `x @ weight + bias` (with weight shape `(nx, nf)`), while PyTorch's `nn.Linear` computes `x @ weight.T + bias` (with weight shape `(nf, nx)`). The `fan_in_fan_out` flag ensures the LoRA layer correctly handles this transposition. Without it, the pretrained weights would be applied incorrectly, and the LoRA update BA would be added in the wrong orientation.
Code Evidence
GPT-2 Conv1D weight layout from `examples/NLG/src/model.py:67-80`:
class Conv1D(nn.Module):
def __init__(self, nf, nx):
super(Conv1D, self).__init__()
self.nf = nf
w = torch.empty(nx, nf) # NOTE: (fan_in, fan_out) layout
nn.init.normal_(w, std=0.02)
self.weight = Parameter(w)
self.bias = Parameter(torch.zeros(nf))
def forward(self, x):
size_out = x.size()[:-1] + (self.nf,)
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight) # x @ weight, NOT x @ weight.T
x = x.view(*size_out)
return x
LoRA MergedLinear with fan_in_fan_out=True in attention from `examples/NLG/src/model.py:94-102`:
self.c_attn = lora.MergedLinear(
nx, n_state * 3,
r=config.lora_attn_dim,
lora_alpha=config.lora_attn_alpha,
lora_dropout=config.lora_dropout,
enable_lora=[True, False, True],
fan_in_fan_out=True,
merge_weights=False
)
Transpose helper function in LoRA Linear from `loralib/layers.py:99,116-117,128-129,145-146`:
fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
# ...
if fan_in_fan_out:
self.weight.data = self.weight.data.transpose(0, 1)
# ...
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w