Implementation:Hiyouga LLaMA Factory Visual Model Utils
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Multimodal Models |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Central registry and utility module for vision-language multimodal model (VLM) architectures in LLaMA-Factory.
Description
The visual module provides a unified framework for handling diverse multimodal model architectures within LLaMA-Factory. It defines the CompositeModel dataclass to describe VLM component structure (projector key, vision tower keys, language model keys, and LoRA conflict keys), and maintains a global COMPOSITE_MODELS registry populated via _register_composite_model. Over 25 VLM architectures are registered, including LLaVA, Qwen2-VL, Qwen3-VL, Gemma3, InternVL, MiniCPM-V, GLM-4V, Mistral3, and more. The module also provides a custom LlavaMultiModalProjectorForYiVL implementation for Yi-VL models and utility functions for freezing components, autocasting projector outputs, and filtering LoRA target modules.
Usage
This module is used internally by the adapter initialization code (init_adapter) and model patcher. get_forbidden_modules determines which components to freeze based on finetuning args, patch_target_modules filters LoRA targets to exclude frozen VLM components, autocast_projector_dtype registers forward hooks for quantized VLM training, and configure_visual_model patches the model before loading.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/model/model_utils/visual.py
- Lines: 1-391
Signature
@dataclass
class CompositeModel:
model_type: str
projector_key: str
vision_model_keys: list[str]
language_model_keys: list[str]
lora_conflict_keys: list[str]
def get_projector(self, module: "torch.nn.Module") -> "torch.nn.Module": ...
COMPOSITE_MODELS: dict[str, "CompositeModel"] = {}
def _register_composite_model(
model_type: str,
projector_key: Optional[str] = None,
vision_model_keys: Optional[list[str]] = None,
language_model_keys: Optional[list[str]] = None,
lora_conflict_keys: Optional[list[str]] = None,
) -> None: ...
def autocast_projector_dtype(
model: "PreTrainedModel",
model_args: "ModelArguments",
) -> None: ...
def configure_visual_model(config: "PretrainedConfig") -> None: ...
def get_forbidden_modules(
config: "PretrainedConfig",
finetuning_args: "FinetuningArguments",
) -> set[str]: ...
def patch_target_modules(
model: "PreTrainedModel",
finetuning_args: "FinetuningArguments",
target_modules: list[str],
) -> list[str]: ...
class LlavaMultiModalProjectorForYiVL(torch.nn.Module):
def __init__(self, config: "LlavaConfig") -> None: ...
def forward(self, image_features: "torch.Tensor") -> "torch.Tensor": ...
Import
from llamafactory.model.model_utils.visual import (
COMPOSITE_MODELS,
get_forbidden_modules,
patch_target_modules,
autocast_projector_dtype,
configure_visual_model,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_type | str | Yes (registration) | Model type identifier matching config.model_type (e.g., "llava", "qwen2_vl") |
| projector_key | str | No (default: "multi_modal_projector") | Dot-separated path to the multimodal projector module |
| vision_model_keys | list[str] | No (default: ["vision_tower"]) | Module name prefixes for vision components |
| config | PretrainedConfig | Yes (functions) | Model configuration for detecting VLM type |
| finetuning_args | FinetuningArguments | Yes (functions) | Controls which components to freeze |
| target_modules | list[str] | Yes (patch_target_modules) | Candidate LoRA target module names to filter |
Outputs
| Name | Type | Description |
|---|---|---|
| get_forbidden_modules | set[str] | Set of module name prefixes that should not be trained |
| patch_target_modules | list[str] | Filtered list of fully-qualified module names safe for LoRA targeting |
| autocast_projector_dtype | None | Side effect: registers forward hook on projector to cast output dtype |
| configure_visual_model | None | Side effect: patches model config and projector class before loading |
Usage Examples
from llamafactory.model.model_utils.visual import (
COMPOSITE_MODELS,
get_forbidden_modules,
patch_target_modules,
)
from llamafactory.hparams import FinetuningArguments
# Check if a model type is a registered composite VLM
model_type = "qwen2_vl"
if model_type in COMPOSITE_MODELS:
composite = COMPOSITE_MODELS[model_type]
print(f"Projector: {composite.projector_key}")
print(f"Vision keys: {composite.vision_model_keys}")
# Get modules to freeze for VLM training
finetuning_args = FinetuningArguments(
freeze_vision_tower=True,
freeze_multi_modal_projector=False,
)
forbidden = get_forbidden_modules(model.config, finetuning_args)
# Returns: {"visual.patch_embed", "visual.blocks"}
# Filter LoRA targets to exclude frozen modules
filtered_targets = patch_target_modules(model, finetuning_args, ["q_proj", "v_proj"])