Principle:Unslothai Unsloth Vision LoRA Injection
| Knowledge Sources | |
|---|---|
| Domains | Vision, NLP, Parameter_Efficient_Finetuning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A parameter-efficient fine-tuning technique that selectively injects LoRA adapters into vision encoder and/or language decoder components of vision-language models.
Description
Vision LoRA injection extends the standard LoRA principle to multimodal architectures. The key distinction is selective layer targeting: VLMs have separate vision and language towers, and the practitioner must choose which components to adapt:
- Vision Layers: Applying LoRA to the vision encoder (e.g., ViT attention/MLP) for learning new visual representations.
- Language Layers: Applying LoRA to the language decoder for learning new text generation behaviors.
- Attention vs. MLP: Fine-grained control over whether LoRA targets attention projections, MLP layers, or both.
The get_peft_regex utility automatically detects which layers belong to vision vs. language towers and generates the appropriate PEFT target module regex based on the user's preferences.
Usage
Apply this principle when fine-tuning vision-language models. Set finetune_vision_layers=True to adapt the vision encoder (necessary for tasks requiring new visual understanding, like OCR on new fonts). Set finetune_language_layers=True for text generation adaptation. Both can be enabled simultaneously.
Theoretical Basis
The LoRA mathematics are identical to text-only LoRA (see LoRA_Adapter_Injection), but applied selectively:
# Abstract selective LoRA for VLMs
target_modules = []
if finetune_vision_layers:
target_modules += vision_encoder.attention_and_mlp_layers
if finetune_language_layers:
target_modules += language_decoder.attention_and_mlp_layers
# Apply LoRA only to selected targets
for layer in target_modules:
layer.weight = W_frozen + (alpha/r) * B @ A