Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ggml org Llama cpp Runtime LoRA Application

From Leeroopedia
Field Value
Principle Name Runtime LoRA Application
Workflow LoRA_Adapter_Workflow
Step 3 of 5 (CORE)
Domain Inference-Time Adapter Application
Scope Applying LoRA adapters dynamically at inference time without full model retraining

Overview

Description

Runtime LoRA application is the core mechanism by which llama.cpp applies LoRA adapter weights during inference. Rather than permanently modifying the base model, the adapter's low-rank matrices are loaded into memory alongside the model and their contributions are computed on-the-fly during each forward pass. This allows multiple adapters to be swapped or combined without reloading the base model.

The runtime approach treats the base model weights as immutable and computes the effective weight for each adapted layer as the sum of the base weight and the scaled low-rank product. This enables dynamic adapter switching, multi-adapter composition with independent scaling factors, and memory-efficient deployment.

Usage

Runtime LoRA application is used when:

  • Users want to switch between different fine-tuned behaviors without reloading the model
  • Multiple LoRA adapters need to be active simultaneously with different scales
  • Memory efficiency is critical and permanent merging is not desired
  • Experimentation with different adapter combinations is needed

Theoretical Basis

The mathematical foundation of runtime LoRA application follows the original LoRA formulation. For each adapted layer, the effective output is computed as:

output = base_weight * x + scale * (alpha / rank) * B @ A * x

Where:

  • base_weight is the frozen pre-trained weight matrix W_0
  • x is the input activation
  • A is the down-projection matrix of dimension (rank x input_dim)
  • B is the up-projection matrix of dimension (output_dim x rank)
  • rank (r) is the low-rank dimension
  • alpha is the scaling hyperparameter from training
  • scale is an additional user-controlled scaling factor applied at inference time

The ratio alpha / rank normalizes the adapter contribution, and the external scale parameter allows users to control the strength of adaptation at runtime. When scale is 0, the adapter has no effect; when scale is 1, the adapter is applied at its trained strength.

For multiple simultaneous adapters, the computation extends to:

output = base_weight * x + sum_i(scale_i * (alpha_i / rank_i) * B_i @ A_i * x)

This linear superposition property means adapters can be combined additively, though the theoretical guarantees of each individual adapter's quality only hold when applied independently.

The key implementation insight is that the low-rank product B @ A can be computed efficiently because the intermediate dimension is small (rank r is typically 4-64), making the computation significantly cheaper than a full weight matrix modification.

In llama.cpp, the runtime mechanism works by:

  1. Loading the GGUF-format LoRA file and validating architecture compatibility with the base model
  2. Mapping adapter tensor names to base model tensor names
  3. During inference, computing the adapted output by adding the scaled low-rank contribution to each layer's computation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment