Principle:Predibase Lorax Dynamic LoRA Loading

Knowledge Sources	LoRA: Low-Rank Adaptation S-LoRA: Serving Many LoRA Adapters
Domains	Parameter_Efficient_Finetuning, Model_Serving
Last Updated	2026-02-08 02:00 GMT

Overview

A runtime adapter loading mechanism that dynamically fetches, validates, and loads LoRA weight matrices into GPU memory on a per-request basis, with LRU caching for frequently used adapters.

Description

Dynamic LoRA Loading is the core innovation of LoRAX. Instead of deploying separate model instances for each fine-tuned adapter, a single base model serves multiple adapters by loading their low-rank weight matrices on demand.

The process involves:

Download: Fetch adapter weights from HuggingFace Hub, S3, or local storage
Validate: Check compatibility (rank, target modules) with the base model
Load: Stack LoRA A and B matrices into GPU tensors, applying scaling factors
Cache: Store in an LRU cache for reuse across requests

The scaling factor is computed as lora_alpha / r (standard) or lora_alpha / sqrt(r) (rsLoRA).

Usage

This principle is applied automatically when a request specifies an adapter_id. The first request for a new adapter triggers loading; subsequent requests hit the cache.

Theoretical Basis

LoRA decomposes weight updates as low-rank matrices:

$W^{'} = W + \frac{α}{r} \cdot B \cdot A$

Where:

W is the frozen base weight [d × d]
A is the down-projection [d × r]
B is the up-projection [r × d]
r is the rank (typically 8-64)
α is the scaling factor

The loading process stacks these matrices across all target layers:

Pseudo-code:

# LoRA weight loading
config = load_peft_config(adapter_id)
weights = load_safetensors(adapter_id)
for layer_id in range(num_layers):
    lora_a = weights[f"layer.{layer_id}.lora_A"]  # [r, d]
    lora_b = weights[f"layer.{layer_id}.lora_B"]  # [d, r]
    scale = lora_alpha / r
    lora_b_scaled = lora_b * scale
stacked_a = torch.stack(all_lora_a)  # [num_layers, d, r]
stacked_b = torch.stack(all_lora_b)  # [num_layers, r, d]

Related Pages

Implemented By

Implementation:Predibase_Lorax_LoRA_Weights_Load

Uses Heuristic

Heuristic:Predibase_Lorax_LoRA_Kernel_Selection_By_Rank

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment