Principle:Ggml org Llama cpp Runtime LoRA Application

Field	Value
Principle Name	Runtime LoRA Application
Workflow	LoRA_Adapter_Workflow
Step	3 of 5 (CORE)
Domain	Inference-Time Adapter Application
Scope	Applying LoRA adapters dynamically at inference time without full model retraining

Overview

Description

Runtime LoRA application is the core mechanism by which llama.cpp applies LoRA adapter weights during inference. Rather than permanently modifying the base model, the adapter's low-rank matrices are loaded into memory alongside the model and their contributions are computed on-the-fly during each forward pass. This allows multiple adapters to be swapped or combined without reloading the base model.

The runtime approach treats the base model weights as immutable and computes the effective weight for each adapted layer as the sum of the base weight and the scaled low-rank product. This enables dynamic adapter switching, multi-adapter composition with independent scaling factors, and memory-efficient deployment.

Usage

Runtime LoRA application is used when:

Users want to switch between different fine-tuned behaviors without reloading the model
Multiple LoRA adapters need to be active simultaneously with different scales
Memory efficiency is critical and permanent merging is not desired
Experimentation with different adapter combinations is needed

Theoretical Basis

The mathematical foundation of runtime LoRA application follows the original LoRA formulation. For each adapted layer, the effective output is computed as:

output = base_weight * x + scale * (alpha / rank) * B @ A * x

Where:

base_weight is the frozen pre-trained weight matrix W_0
x is the input activation
A is the down-projection matrix of dimension (rank x input_dim)
B is the up-projection matrix of dimension (output_dim x rank)
rank (r) is the low-rank dimension
alpha is the scaling hyperparameter from training
scale is an additional user-controlled scaling factor applied at inference time

The ratio alpha / rank normalizes the adapter contribution, and the external scale parameter allows users to control the strength of adaptation at runtime. When scale is 0, the adapter has no effect; when scale is 1, the adapter is applied at its trained strength.

For multiple simultaneous adapters, the computation extends to:

output = base_weight * x + sum_i(scale_i * (alpha_i / rank_i) * B_i @ A_i * x)

This linear superposition property means adapters can be combined additively, though the theoretical guarantees of each individual adapter's quality only hold when applied independently.

The key implementation insight is that the low-rank product B @ A can be computed efficiently because the intermediate dimension is small (rank r is typically 4-64), making the computation significantly cheaper than a full weight matrix modification.

In llama.cpp, the runtime mechanism works by:

Loading the GGUF-format LoRA file and validating architecture compatibility with the base model
Mapping adapter tensor names to base model tensor names
During inference, computing the adapted output by adding the scaled low-rank contribution to each layer's computation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment