Implementation:Hiyouga LLaMA Factory Visual Model Utils

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Machine Learning, Multimodal Models
Last Updated	2026-02-06 19:00 GMT

Overview

Central registry and utility module for vision-language multimodal model (VLM) architectures in LLaMA-Factory.

Description

The visual module provides a unified framework for handling diverse multimodal model architectures within LLaMA-Factory. It defines the CompositeModel dataclass to describe VLM component structure (projector key, vision tower keys, language model keys, and LoRA conflict keys), and maintains a global COMPOSITE_MODELS registry populated via _register_composite_model. Over 25 VLM architectures are registered, including LLaVA, Qwen2-VL, Qwen3-VL, Gemma3, InternVL, MiniCPM-V, GLM-4V, Mistral3, and more. The module also provides a custom LlavaMultiModalProjectorForYiVL implementation for Yi-VL models and utility functions for freezing components, autocasting projector outputs, and filtering LoRA target modules.

Usage

This module is used internally by the adapter initialization code (init_adapter) and model patcher. get_forbidden_modules determines which components to freeze based on finetuning args, patch_target_modules filters LoRA targets to exclude frozen VLM components, autocast_projector_dtype registers forward hooks for quantized VLM training, and configure_visual_model patches the model before loading.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/model/model_utils/visual.py
Lines: 1-391

Signature

@dataclass
class CompositeModel:
    model_type: str
    projector_key: str
    vision_model_keys: list[str]
    language_model_keys: list[str]
    lora_conflict_keys: list[str]
    def get_projector(self, module: "torch.nn.Module") -> "torch.nn.Module": ...

COMPOSITE_MODELS: dict[str, "CompositeModel"] = {}

def _register_composite_model(
    model_type: str,
    projector_key: Optional[str] = None,
    vision_model_keys: Optional[list[str]] = None,
    language_model_keys: Optional[list[str]] = None,
    lora_conflict_keys: Optional[list[str]] = None,
) -> None: ...

def autocast_projector_dtype(
    model: "PreTrainedModel",
    model_args: "ModelArguments",
) -> None: ...

def configure_visual_model(config: "PretrainedConfig") -> None: ...

def get_forbidden_modules(
    config: "PretrainedConfig",
    finetuning_args: "FinetuningArguments",
) -> set[str]: ...

def patch_target_modules(
    model: "PreTrainedModel",
    finetuning_args: "FinetuningArguments",
    target_modules: list[str],
) -> list[str]: ...

class LlavaMultiModalProjectorForYiVL(torch.nn.Module):
    def __init__(self, config: "LlavaConfig") -> None: ...
    def forward(self, image_features: "torch.Tensor") -> "torch.Tensor": ...

Import

from llamafactory.model.model_utils.visual import (
    COMPOSITE_MODELS,
    get_forbidden_modules,
    patch_target_modules,
    autocast_projector_dtype,
    configure_visual_model,
)

I/O Contract

Inputs

Name	Type	Required	Description
model_type	str	Yes (registration)	Model type identifier matching config.model_type (e.g., "llava", "qwen2_vl")
projector_key	str	No (default: "multi_modal_projector")	Dot-separated path to the multimodal projector module
vision_model_keys	list[str]	No (default: ["vision_tower"])	Module name prefixes for vision components
config	PretrainedConfig	Yes (functions)	Model configuration for detecting VLM type
finetuning_args	FinetuningArguments	Yes (functions)	Controls which components to freeze
target_modules	list[str]	Yes (patch_target_modules)	Candidate LoRA target module names to filter

Outputs

Name	Type	Description
get_forbidden_modules	set[str]	Set of module name prefixes that should not be trained
patch_target_modules	list[str]	Filtered list of fully-qualified module names safe for LoRA targeting
autocast_projector_dtype	None	Side effect: registers forward hook on projector to cast output dtype
configure_visual_model	None	Side effect: patches model config and projector class before loading

Usage Examples

from llamafactory.model.model_utils.visual import (
    COMPOSITE_MODELS,
    get_forbidden_modules,
    patch_target_modules,
)
from llamafactory.hparams import FinetuningArguments

# Check if a model type is a registered composite VLM
model_type = "qwen2_vl"
if model_type in COMPOSITE_MODELS:
    composite = COMPOSITE_MODELS[model_type]
    print(f"Projector: {composite.projector_key}")
    print(f"Vision keys: {composite.vision_model_keys}")

# Get modules to freeze for VLM training
finetuning_args = FinetuningArguments(
    freeze_vision_tower=True,
    freeze_multi_modal_projector=False,
)
forbidden = get_forbidden_modules(model.config, finetuning_args)
# Returns: {"visual.patch_embed", "visual.blocks"}

# Filter LoRA targets to exclude frozen modules
filtered_targets = patch_target_modules(model, finetuning_args, ["q_proj", "v_proj"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment