Implementation:Unslothai Unsloth FastVisionModel From Pretrained
| Knowledge Sources | |
|---|---|
| Domains | Vision, NLP, Quantization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading vision-language models with 4-bit quantization and optimized attention provided by the Unsloth library.
Description
FastVisionModel.from_pretrained loads VLMs (Qwen2-VL, Qwen2.5-VL, Llava, Pixtral, Gemma3) with BitsAndBytes 4-bit quantization. It returns both the patched multimodal model and an AutoProcessor for image/text preprocessing. Internally, FastVisionModel is a thin subclass of FastModel which delegates to FastBaseModel.from_pretrained at unsloth/models/vision.py. The vision encoder components are handled separately from the language decoder during quantization.
Usage
Import this as the first step in any vision model fine-tuning workflow. Returns a processor (not a tokenizer) suitable for preparing multimodal datasets.
Code Reference
Source Location
- Repository: unsloth
- File: unsloth/models/loader.py (L1373-1374) dispatches to unsloth/models/vision.py (L322-939)
Signature
class FastVisionModel(FastModel):
pass # Inherits from FastModel -> FastBaseModel
class FastBaseModel:
@staticmethod
def from_pretrained(
model_name = "unsloth/Llama-3.2-1B-Instruct",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
load_in_8bit = False,
load_in_16bit = False,
full_finetuning = False,
token = None,
device_map = "sequential",
trust_remote_code = False,
model_types = None,
tokenizer_name = None,
auto_model = AutoModelForVision2Seq,
use_gradient_checkpointing = "unsloth",
supports_sdpa = True,
whisper_language = None,
whisper_task = None,
auto_config = None,
offload_embedding = False,
float32_mixed_precision = None,
fast_inference = False,
gpu_memory_utilization = 0.5,
float8_kv_cache = False,
random_state = 3407,
max_lora_rank = 64,
disable_log_stats = False,
unsloth_vllm_standby = False,
**kwargs,
) -> Tuple[PreTrainedModel, AutoProcessor]:
"""
Args:
model_name: VLM model ID (e.g., "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit").
load_in_4bit: Enable 4-bit quantization. Default True.
auto_model: AutoModel class. Default AutoModelForVision2Seq.
fast_inference: Enable vLLM (supported for Qwen2.5-VL, Gemma3).
"""
Import
from unsloth import FastVisionModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | No | VLM model ID from HuggingFace Hub |
| max_seq_length | int | No | Maximum context length (default: 2048) |
| load_in_4bit | bool | No | Enable 4-bit quantization (default: True) |
| dtype | torch.dtype | No | Compute dtype (auto-selects if None) |
| fast_inference | bool | No | Enable vLLM for supported VLMs (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel | Patched VLM with optimized attention kernels |
| processor | AutoProcessor | Multimodal processor for image/text preprocessing |
Usage Examples
Load Qwen2-VL for Fine-tuning
from unsloth import FastVisionModel
model, processor = FastVisionModel.from_pretrained(
model_name="unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
# processor handles both image and text preprocessing
# Use processor.tokenizer for text-only operations