Implementation:Unslothai Unsloth FastVisionModel From Pretrained

Knowledge Sources	Unsloth HuggingFace VLM Models
Domains	Vision, NLP, Quantization
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading vision-language models with 4-bit quantization and optimized attention provided by the Unsloth library.

Description

FastVisionModel.from_pretrained loads VLMs (Qwen2-VL, Qwen2.5-VL, Llava, Pixtral, Gemma3) with BitsAndBytes 4-bit quantization. It returns both the patched multimodal model and an AutoProcessor for image/text preprocessing. Internally, FastVisionModel is a thin subclass of FastModel which delegates to FastBaseModel.from_pretrained at unsloth/models/vision.py. The vision encoder components are handled separately from the language decoder during quantization.

Usage

Import this as the first step in any vision model fine-tuning workflow. Returns a processor (not a tokenizer) suitable for preparing multimodal datasets.

Code Reference

Source Location

Repository: unsloth
File: unsloth/models/loader.py (L1373-1374) dispatches to unsloth/models/vision.py (L322-939)

Signature

class FastVisionModel(FastModel):
    pass  # Inherits from FastModel -> FastBaseModel

class FastBaseModel:
    @staticmethod
    def from_pretrained(
        model_name = "unsloth/Llama-3.2-1B-Instruct",
        max_seq_length = 2048,
        dtype = None,
        load_in_4bit = True,
        load_in_8bit = False,
        load_in_16bit = False,
        full_finetuning = False,
        token = None,
        device_map = "sequential",
        trust_remote_code = False,
        model_types = None,
        tokenizer_name = None,
        auto_model = AutoModelForVision2Seq,
        use_gradient_checkpointing = "unsloth",
        supports_sdpa = True,
        whisper_language = None,
        whisper_task = None,
        auto_config = None,
        offload_embedding = False,
        float32_mixed_precision = None,
        fast_inference = False,
        gpu_memory_utilization = 0.5,
        float8_kv_cache = False,
        random_state = 3407,
        max_lora_rank = 64,
        disable_log_stats = False,
        unsloth_vllm_standby = False,
        **kwargs,
    ) -> Tuple[PreTrainedModel, AutoProcessor]:
        """
        Args:
            model_name: VLM model ID (e.g., "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit").
            load_in_4bit: Enable 4-bit quantization. Default True.
            auto_model: AutoModel class. Default AutoModelForVision2Seq.
            fast_inference: Enable vLLM (supported for Qwen2.5-VL, Gemma3).
        """

Import

from unsloth import FastVisionModel

I/O Contract

Inputs

Name	Type	Required	Description
model_name	str	No	VLM model ID from HuggingFace Hub
max_seq_length	int	No	Maximum context length (default: 2048)
load_in_4bit	bool	No	Enable 4-bit quantization (default: True)
dtype	torch.dtype	No	Compute dtype (auto-selects if None)
fast_inference	bool	No	Enable vLLM for supported VLMs (default: False)

Outputs

Name	Type	Description
model	PreTrainedModel	Patched VLM with optimized attention kernels
processor	AutoProcessor	Multimodal processor for image/text preprocessing

Usage Examples

Load Qwen2-VL for Fine-tuning

from unsloth import FastVisionModel

model, processor = FastVisionModel.from_pretrained(
    model_name="unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

# processor handles both image and text preprocessing
# Use processor.tokenizer for text-only operations

Related Pages

Implements Principle

Principle:Unslothai_Unsloth_Vision_Model_Loading

Requires Environment

Environment:Unslothai_Unsloth_CUDA_BitsAndBytes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment