Principle:Ggml org Llama cpp LoRA Adapter Acquisition

Field	Value
Principle Name	LoRA Adapter Acquisition
Workflow	LoRA_Adapter_Workflow
Step	1 of 5
Domain	Parameter-Efficient Fine-Tuning (PEFT)
Scope	Acquiring pre-trained LoRA adapter weights from external repositories

Overview

Description

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each layer of the Transformer architecture. Instead of fine-tuning the full weight matrix W of dimension d x k, LoRA constrains the update to a low-rank decomposition W + delta_W = W + B * A, where B is a d x r matrix and A is an r x k matrix, with the rank r being much smaller than both d and k.

Acquiring LoRA adapters is the first step in the LoRA workflow within llama.cpp. Pre-trained LoRA adapters are typically distributed through model hubs such as HuggingFace, where they are stored in standard formats (safetensors or PyTorch bin) alongside configuration metadata. These adapters encode the fine-tuning deltas that customize a base model for specific tasks such as instruction following, code generation, or domain-specific knowledge.

Usage

LoRA adapter acquisition is relevant when a user wants to:

Apply a community-trained fine-tune to a base model without retraining from scratch
Combine multiple specialized adapters for different capabilities
Reduce storage and distribution costs by sharing small adapter files instead of full model weights
Experiment with different fine-tuning configurations on the same base model

Theoretical Basis

The theoretical foundation of LoRA rests on the hypothesis that the weight updates during model adaptation have a low intrinsic rank. Given a pre-trained weight matrix W_0 in R^{d x k}, LoRA represents the update as:

W = W_0 + (alpha / r) * B * A

Where:

W_0 is the frozen pre-trained weight matrix
A in R^{r x k} is initialized with a random Gaussian distribution
B in R^{d x r} is initialized to zero so that delta_W = B * A is zero at the start of training
r is the rank of the decomposition (typically 4, 8, 16, 32, or 64)
alpha is a scaling hyperparameter that controls the magnitude of the adaptation

The key insight is that during fine-tuning, the learned weight modifications tend to occupy a low-dimensional subspace. By constraining the update to rank r, LoRA achieves comparable performance to full fine-tuning while only adding a small number of trainable parameters (proportional to r * (d + k) instead of d * k).

A typical LoRA adapter distribution consists of:

adapter_model.safetensors (or adapter_model.bin): Contains the learned A and B matrices for each adapted layer
adapter_config.json: Contains metadata including the base model identifier, rank (r), alpha (lora_alpha), target modules, and other PEFT configuration

The adapter_config.json encodes critical parameters:

"r": The rank of the low-rank decomposition
"lora_alpha": The scaling factor applied during inference
"base_model_name_or_path": Identifies the compatible base model
"target_modules": Lists which weight matrices in the model have LoRA applied

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment