Implementation:Hiyouga LLaMA Factory Multimodal Plugin
| Knowledge Sources | |
|---|---|
| Domains | Multimodal Processing, Vision-Language Models |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Concrete multimodal plugin system for processing images, videos, and audio across 20+ vision-language model architectures provided by LLaMA Factory.
Description
This module implements the multimodal abstraction layer that enables LLaMA Factory to support a wide range of vision-language and audio-language models through a unified preprocessing interface. The architecture consists of:
- MMPluginMixin -- Base mixin defining token constants (image_token, video_token, audio_token), input validation, and common media processing methods (image resizing, video frame sampling, audio resampling)
- BasePlugin -- Default plugin that handles message placeholder injection, token ID expansion, and multimodal input tensor generation using standard HuggingFace processor pipelines
- 20+ Model-Specific Plugins -- Subclasses that override specific methods to match each model's expected input format:
- Qwen2VLPlugin, Qwen3VLPlugin -- Handle mRoPE grid-based position IDs and video grid THW tensors
- LlavaPlugin, LlavaNextPlugin, LlavaNextVideoPlugin -- Handle LLaVA-family image/video token expansion
- InternVLPlugin -- Custom image token replacement with dynamic patch counting
- MiniCPMVPlugin -- Special image bound and slice processing
- MllamaPlugin -- Cross-attention mask generation for LLaMA-3.2 Vision
- PaliGemmaPlugin, Gemma3Plugin -- Token type ID generation for loss masking
- PixtralPlugin -- Image break and end tokens for Pixtral architecture
- GLM4VPlugin -- GLM-4 vision-specific placeholder handling
- Qwen2AudioPlugin, Qwen2OmniPlugin -- Audio feature processing
- And more: ErnieVL, KimiVL, Llama4, VideoLlava, LFMVL, YoutuVL, Gemma3n
A registry pattern with get_mm_plugin selects the correct plugin by name at template registration time.
Usage
Multimodal plugins are instantiated during template creation via get_mm_plugin. They are called at three stages: (1) process_messages during data conversion to inject model-specific placeholder tokens, (2) process_token_ids during tokenization to expand media tokens to the correct sequence length, and (3) get_mm_inputs during collation to generate pixel values, grid sizes, and other multimodal tensors.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/data/mm_plugin.py
- Lines: 1-2241
Signature
@dataclass
class MMPluginMixin:
image_token: str | None
video_token: str | None
audio_token: str | None
expand_mm_tokens: bool = True
def _validate_input(self, processor, images, videos, audios) -> None: ...
def _validate_messages(self, messages, images, videos, audios) -> None: ...
def _preprocess_image(self, image, image_max_pixels, image_min_pixels) -> "ImageObject": ...
def _regularize_images(self, images, **kwargs) -> "RegularizedImageOutput": ...
def _regularize_videos(self, videos, **kwargs) -> "RegularizedVideoOutput": ...
def _regularize_audios(self, audios, sampling_rate, **kwargs) -> "RegularizedAudioOutput": ...
def _get_mm_inputs(self, images, videos, audios, processor) -> dict[str, "torch.Tensor"]: ...
@dataclass
class BasePlugin(MMPluginMixin):
def process_messages(self, messages, images, videos, audios, processor) -> list[dict[str, str]]: ...
def process_token_ids(self, input_ids, labels, images, videos, audios, tokenizer, processor) -> tuple[list[int], list[int] | None]: ...
def get_mm_inputs(self, images, videos, audios, imglens, vidlens, audlens, batch_ids, processor) -> dict[str, Any]: ...
def get_mm_plugin(name: str, image_token: str | None = None, ...) -> "BasePlugin": ...
def register_mm_plugin(name: str, mm_plugin: type["BasePlugin"]) -> None: ...
Import
from llamafactory.data.mm_plugin import get_mm_plugin, BasePlugin
I/O Contract
Inputs (process_messages)
| Name | Type | Required | Description |
|---|---|---|---|
| messages | list[dict[str, str]] | Yes | Chat messages with placeholders to be processed |
| images | list[ImageInput] | Yes | Image inputs (paths, bytes, PIL Images, or encoded dicts) |
| videos | list[VideoInput] | Yes | Video inputs (paths, file objects, or nested frame lists) |
| audios | list[AudioInput] | Yes | Audio inputs (paths, file objects, or numpy arrays) |
| processor | ProcessorMixin | Yes | HuggingFace processor with image_processor and feature_extractor |
Outputs (get_mm_inputs)
| Name | Type | Description |
|---|---|---|
| pixel_values | torch.Tensor | Processed image/video pixel values with model-specific shape |
| image_grid_thw | torch.Tensor | Grid dimensions for Qwen2VL-family models |
| cross_attention_mask | torch.Tensor | Cross-attention mask for Mllama models |
| input_features | torch.Tensor | Audio features for audio-language models |
| token_type_ids | list[list[int]] | Token type IDs for PaliGemma/Gemma3 loss masking |
Supported Model Plugins
| Plugin Name | Target Models | Modalities |
|---|---|---|
| base | Default/generic VLMs | Image, Video, Audio |
| qwen2_vl | Qwen2-VL | Image, Video |
| qwen3_vl | Qwen3-VL | Image, Video |
| llava | LLaVA 1.5 | Image |
| llava_next | LLaVA-NeXT | Image |
| llava_next_video | LLaVA-NeXT-Video | Image, Video |
| internvl | InternVL | Image, Video |
| minicpmv | MiniCPM-V | Image, Video, Audio |
| mllama | LLaMA-3.2-Vision | Image |
| paligemma | PaliGemma | Image |
| gemma3 | Gemma-3 | Image |
| pixtral | Pixtral | Image |
| glm4v | GLM-4V | Image, Video |
| qwen2_audio | Qwen2-Audio | Audio |
| qwen2_omni | Qwen2.5-Omni | Image, Video, Audio |
Usage Examples
from llamafactory.data.mm_plugin import get_mm_plugin
# Get the plugin for Qwen2-VL
plugin = get_mm_plugin(
name="qwen2_vl",
image_token="<|image_pad|>",
video_token="<|video_pad|>",
)
# Process messages to inject model-specific tokens
messages = plugin.process_messages(messages, images, videos, audios, processor)
# Expand token IDs for media placeholders
input_ids, labels = plugin.process_token_ids(
input_ids, labels, images, videos, audios, tokenizer, processor
)
# Generate multimodal input tensors for the model
mm_inputs = plugin.get_mm_inputs(
images, videos, audios,
imglens=[1], vidlens=[0], audlens=[0],
batch_ids=[input_ids], processor=processor,
)
Related Pages
- Hiyouga_LLaMA_Factory_Chat_Template - Template that holds a reference to the mm_plugin
- Hiyouga_LLaMA_Factory_Data_Collator - Collators that call get_mm_inputs during batching
- Hiyouga_LLaMA_Factory_HfChatEngine - Inference engine that uses mm_plugin for multimodal input preparation
- Hiyouga_LLaMA_Factory_Constants - IMAGE_PLACEHOLDER, VIDEO_PLACEHOLDER, AUDIO_PLACEHOLDER constants