Principle:Ggml org Llama cpp LoRA Adapter Acquisition
| Field | Value |
|---|---|
| Principle Name | LoRA Adapter Acquisition |
| Workflow | LoRA_Adapter_Workflow |
| Step | 1 of 5 |
| Domain | Parameter-Efficient Fine-Tuning (PEFT) |
| Scope | Acquiring pre-trained LoRA adapter weights from external repositories |
Overview
Description
Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable low-rank decomposition matrices into each layer of the Transformer architecture. Instead of fine-tuning the full weight matrix W of dimension d x k, LoRA constrains the update to a low-rank decomposition W + delta_W = W + B * A, where B is a d x r matrix and A is an r x k matrix, with the rank r being much smaller than both d and k.
Acquiring LoRA adapters is the first step in the LoRA workflow within llama.cpp. Pre-trained LoRA adapters are typically distributed through model hubs such as HuggingFace, where they are stored in standard formats (safetensors or PyTorch bin) alongside configuration metadata. These adapters encode the fine-tuning deltas that customize a base model for specific tasks such as instruction following, code generation, or domain-specific knowledge.
Usage
LoRA adapter acquisition is relevant when a user wants to:
- Apply a community-trained fine-tune to a base model without retraining from scratch
- Combine multiple specialized adapters for different capabilities
- Reduce storage and distribution costs by sharing small adapter files instead of full model weights
- Experiment with different fine-tuning configurations on the same base model
Theoretical Basis
The theoretical foundation of LoRA rests on the hypothesis that the weight updates during model adaptation have a low intrinsic rank. Given a pre-trained weight matrix W_0 in R^{d x k}, LoRA represents the update as:
W = W_0 + (alpha / r) * B * A
Where:
- W_0 is the frozen pre-trained weight matrix
- A in R^{r x k} is initialized with a random Gaussian distribution
- B in R^{d x r} is initialized to zero so that delta_W = B * A is zero at the start of training
- r is the rank of the decomposition (typically 4, 8, 16, 32, or 64)
- alpha is a scaling hyperparameter that controls the magnitude of the adaptation
The key insight is that during fine-tuning, the learned weight modifications tend to occupy a low-dimensional subspace. By constraining the update to rank r, LoRA achieves comparable performance to full fine-tuning while only adding a small number of trainable parameters (proportional to r * (d + k) instead of d * k).
A typical LoRA adapter distribution consists of:
- adapter_model.safetensors (or adapter_model.bin): Contains the learned A and B matrices for each adapted layer
- adapter_config.json: Contains metadata including the base model identifier, rank (r), alpha (lora_alpha), target modules, and other PEFT configuration
The adapter_config.json encodes critical parameters:
- "r": The rank of the low-rank decomposition
- "lora_alpha": The scaling factor applied during inference
- "base_model_name_or_path": Identifies the compatible base model
- "target_modules": Lists which weight matrices in the model have LoRA applied