Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Turboderp org Exllamav2 LoRA Adapter Loading

From Leeroopedia
Knowledge Sources
Domains Fine_Tuning, Parameter_Efficient, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large pretrained models by injecting trainable low-rank decomposition matrices into frozen model layers.

Description

Instead of fine-tuning all parameters of a pretrained model, LoRA freezes the original weights and introduces pairs of small trainable matrices A and B into targeted linear layers. The weight update is expressed as:

W' = W + BA

where B is in R^{d x r} and A is in R^{r x k}, with rank r much smaller than min(d, k). This dramatically reduces the number of trainable parameters while preserving the model's original capabilities.

At inference time, the LoRA adapter weights are loaded alongside the base model and applied during the forward pass. For each targeted linear layer, the output becomes:

output = Wx + (BA)x * scale

The scaling factor is computed as alpha / r, where alpha is a hyperparameter set during LoRA training that controls the magnitude of the adaptation. An additional lora_scaling multiplier can be applied at load time to further adjust the adapter's influence on the model output.

Key properties of LoRA:

  • Parameter efficiency: Only the low-rank matrices A and B are stored per adapter, typically orders of magnitude smaller than the full model weights.
  • Composability: Multiple LoRA adapters can be loaded and potentially combined, enabling task-specific specialization without duplicating the base model.
  • No additional inference latency: The low-rank multiplication can be fused into the forward pass with minimal overhead.
  • Format compatibility: LoRA adapters are commonly distributed in the HuggingFace PEFT format, with an adapter_config.json describing the architecture and adapter_model.safetensors containing the weight tensors.

Usage

Use LoRA adapter loading when you want to apply a fine-tuned specialization (e.g., instruction following, domain-specific knowledge, style adaptation) to a base language model without modifying or duplicating the base model weights. This is especially useful when switching between multiple fine-tuned variants of the same base model.

Theoretical Basis

The core insight of LoRA is that the weight updates during fine-tuning have a low intrinsic rank. Given a pretrained weight matrix W in R^{d x k}, LoRA parameterizes the update as:

W' = W + BA

where:
  B ∈ R^{d × r}   (initialized to zero)
  A ∈ R^{r × k}   (initialized from Gaussian)
  r << min(d, k)   (the rank, typically 8-64)

During the forward pass for a linear layer:

h = W'x = Wx + BAx * (alpha / r)

where:
  x     = input activations
  alpha = scaling hyperparameter (set during training)
  r     = rank of the decomposition

The scaling factor alpha / r ensures that the adapter's contribution is appropriately scaled regardless of the chosen rank. A higher alpha relative to r amplifies the adapter's effect, while a lower ratio attenuates it.

At load time, an additional lora_scaling multiplier can be applied:

effective_scale = (alpha / r) * lora_scaling

This allows runtime control over how strongly the adapter influences generation without retraining.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment