Principle:Ollama Ollama Model Adaptation
| Knowledge Sources | |
|---|---|
| Domains | Fine-Tuning, Model Adaptation |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Model Adaptation enables the integration of LoRA (Low-Rank Adaptation) and other adapter-based fine-tuning methods into the Ollama inference pipeline, allowing users to apply lightweight parameter modifications on top of base models without duplicating the full model weights.
Core Concepts
Low-Rank Adaptation (LoRA)
LoRA decomposes weight updates into two low-rank matrices A and B such that the adapted weight is W' = W + alpha * (B @ A), where W is the original pretrained weight, and A and B are much smaller matrices whose product approximates the fine-tuning delta. This dramatically reduces the number of trainable parameters and storage requirements while preserving the ability to specialize model behavior for specific tasks or domains.
Adapter Stacking
Multiple adapters can be conceptually stacked on a single base model. During inference, the adapter weights are merged with the base model weights either at load time (weight merging) or applied dynamically during the forward pass. Weight merging is simpler and has no runtime overhead, while dynamic application allows switching adapters without reloading the base model.
Adapter Conversion
Adapters trained in external frameworks (e.g., HuggingFace PEFT) must be converted to Ollama's internal format. The conversion process maps adapter tensor names to the base model's tensor naming convention, verifies rank and dimension compatibility, and packages the adapter weights alongside the necessary metadata for correct application during inference.
Scale Factor
The adapter's effect is controlled by a scale factor (alpha / rank) that modulates the magnitude of the low-rank update. This allows users to tune the strength of the adaptation without retraining, providing a continuous spectrum between the base model's behavior and the fully adapted behavior.
Implementation Notes
Adapter support in the Ollama codebase spans conversion and runtime. The converter for Llama-family adapters is in convert/convert_llama_adapter.go and convert/convert_gemma2_adapter.go. At runtime, adapter weights are loaded alongside the base model and merged into the model's weight tensors during the loading phase. The Modelfile ADAPTER directive specifies the adapter to apply, and the creation pipeline handles packaging both base model and adapter references into the resulting manifest.