Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Predibase Lorax Dynamic LoRA Loading

From Leeroopedia


Knowledge Sources
Domains Parameter_Efficient_Finetuning, Model_Serving
Last Updated 2026-02-08 02:00 GMT

Overview

A runtime adapter loading mechanism that dynamically fetches, validates, and loads LoRA weight matrices into GPU memory on a per-request basis, with LRU caching for frequently used adapters.

Description

Dynamic LoRA Loading is the core innovation of LoRAX. Instead of deploying separate model instances for each fine-tuned adapter, a single base model serves multiple adapters by loading their low-rank weight matrices on demand.

The process involves:

  1. Download: Fetch adapter weights from HuggingFace Hub, S3, or local storage
  2. Validate: Check compatibility (rank, target modules) with the base model
  3. Load: Stack LoRA A and B matrices into GPU tensors, applying scaling factors
  4. Cache: Store in an LRU cache for reuse across requests

The scaling factor is computed as lora_alpha / r (standard) or lora_alpha / sqrt(r) (rsLoRA).

Usage

This principle is applied automatically when a request specifies an adapter_id. The first request for a new adapter triggers loading; subsequent requests hit the cache.

Theoretical Basis

LoRA decomposes weight updates as low-rank matrices:

W=W+αrBA

Where:

  • W is the frozen base weight [d × d]
  • A is the down-projection [d × r]
  • B is the up-projection [r × d]
  • r is the rank (typically 8-64)
  • α is the scaling factor

The loading process stacks these matrices across all target layers:

Pseudo-code:

# LoRA weight loading
config = load_peft_config(adapter_id)
weights = load_safetensors(adapter_id)
for layer_id in range(num_layers):
    lora_a = weights[f"layer.{layer_id}.lora_A"]  # [r, d]
    lora_b = weights[f"layer.{layer_id}.lora_B"]  # [d, r]
    scale = lora_alpha / r
    lora_b_scaled = lora_b * scale
stacked_a = torch.stack(all_lora_a)  # [num_layers, d, r]
stacked_b = torch.stack(all_lora_b)  # [num_layers, r, d]

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment