Principle:Predibase Lorax Dynamic LoRA Loading
| Knowledge Sources | |
|---|---|
| Domains | Parameter_Efficient_Finetuning, Model_Serving |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A runtime adapter loading mechanism that dynamically fetches, validates, and loads LoRA weight matrices into GPU memory on a per-request basis, with LRU caching for frequently used adapters.
Description
Dynamic LoRA Loading is the core innovation of LoRAX. Instead of deploying separate model instances for each fine-tuned adapter, a single base model serves multiple adapters by loading their low-rank weight matrices on demand.
The process involves:
- Download: Fetch adapter weights from HuggingFace Hub, S3, or local storage
- Validate: Check compatibility (rank, target modules) with the base model
- Load: Stack LoRA A and B matrices into GPU tensors, applying scaling factors
- Cache: Store in an LRU cache for reuse across requests
The scaling factor is computed as lora_alpha / r (standard) or lora_alpha / sqrt(r) (rsLoRA).
Usage
This principle is applied automatically when a request specifies an adapter_id. The first request for a new adapter triggers loading; subsequent requests hit the cache.
Theoretical Basis
LoRA decomposes weight updates as low-rank matrices:
Where:
- W is the frozen base weight [d × d]
- A is the down-projection [d × r]
- B is the up-projection [r × d]
- r is the rank (typically 8-64)
- α is the scaling factor
The loading process stacks these matrices across all target layers:
Pseudo-code:
# LoRA weight loading
config = load_peft_config(adapter_id)
weights = load_safetensors(adapter_id)
for layer_id in range(num_layers):
lora_a = weights[f"layer.{layer_id}.lora_A"] # [r, d]
lora_b = weights[f"layer.{layer_id}.lora_B"] # [d, r]
scale = lora_alpha / r
lora_b_scaled = lora_b * scale
stacked_a = torch.stack(all_lora_a) # [num_layers, d, r]
stacked_b = torch.stack(all_lora_b) # [num_layers, r, d]