Principle:Microsoft LoRA LoRA Training Loop

Knowledge Sources	Microsoft LoRA LoRA
Domains	Training, Optimization
Last Updated	2026-02-10 05:00 GMT

Overview

Principle that LoRA fine-tuning uses a standard PyTorch training loop with no special training procedures, frameworks, or custom backward passes required.

Description

One of LoRA's key design advantages is that it requires no modifications to the training loop. Once LoRA layers are in place and non-LoRA parameters are frozen, standard PyTorch training proceeds as usual: forward pass, loss computation, backward pass, and optimizer step. The optimizer automatically only updates parameters with requires_grad=True, which are exclusively the LoRA matrices (and optionally biases). This makes LoRA compatible with any existing PyTorch training infrastructure.

Usage

Use standard PyTorch training patterns after model preparation (LoRA layer replacement + parameter freezing). The only LoRA-specific consideration is filtering parameters when constructing the optimizer to avoid unnecessary memory allocation for frozen parameter groups.

Theoretical Basis

Gradient Flow

During the backward pass, gradients flow through both the frozen pretrained weights and the trainable LoRA matrices. However, because requires_grad=False is set on pretrained parameters, PyTorch's autograd engine does not compute or store gradients for them. Only the LoRA matrices A and B accumulate gradients:

$\frac{\partial L}{\partial A} = B^{T} \frac{\partial L}{\partial h} x^{T}$

$\frac{\partial L}{\partial B} = \frac{\partial L}{\partial h} (A x)^{T}$

where L is the loss, h is the layer output, and x is the layer input.

Memory Efficiency During Training

The training loop is memory-efficient for two reasons:

Optimizer states: Adam/AdamW only allocates momentum and variance buffers for LoRA parameters (the tiny fraction with requires_grad=True)
Gradient storage: Gradients are only computed and stored for LoRA parameters

This means the per-step memory footprint is dominated by the forward activations (which are the same as full fine-tuning) rather than optimizer states (which are dramatically reduced).

Optimizer Configuration

The optimizer should be constructed with a parameter filter to receive only trainable parameters:

optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=learning_rate
)

This filter ensures the optimizer does not allocate states for frozen parameters. Without this filter, PyTorch would still not update frozen parameters (since their gradients are None), but the optimizer would wastefully allocate state buffers for them.

Standard Loop Pattern

The training loop follows the canonical PyTorch pattern:

Forward pass: Compute model output and loss
Backward pass: Compute gradients via loss.backward()
Optimizer step: Update LoRA parameters via optimizer.step()
Zero gradients: Clear accumulated gradients via optimizer.zero_grad()

No LoRA-specific hooks, callbacks, or custom backward passes are needed.

Related Pages

Implemented By

Implementation:Microsoft_LoRA_PyTorch_Training_Loop

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment