Implementation:Hiyouga LLaMA Factory Multimodal Plugin

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Multimodal Processing, Vision-Language Models
Last Updated	2026-02-06 19:00 GMT

Overview

Concrete multimodal plugin system for processing images, videos, and audio across 20+ vision-language model architectures provided by LLaMA Factory.

Description

This module implements the multimodal abstraction layer that enables LLaMA Factory to support a wide range of vision-language and audio-language models through a unified preprocessing interface. The architecture consists of:

MMPluginMixin -- Base mixin defining token constants (image_token, video_token, audio_token), input validation, and common media processing methods (image resizing, video frame sampling, audio resampling)
BasePlugin -- Default plugin that handles message placeholder injection, token ID expansion, and multimodal input tensor generation using standard HuggingFace processor pipelines
20+ Model-Specific Plugins -- Subclasses that override specific methods to match each model's expected input format:
- Qwen2VLPlugin, Qwen3VLPlugin -- Handle mRoPE grid-based position IDs and video grid THW tensors
- LlavaPlugin, LlavaNextPlugin, LlavaNextVideoPlugin -- Handle LLaVA-family image/video token expansion
- InternVLPlugin -- Custom image token replacement with dynamic patch counting
- MiniCPMVPlugin -- Special image bound and slice processing
- MllamaPlugin -- Cross-attention mask generation for LLaMA-3.2 Vision
- PaliGemmaPlugin, Gemma3Plugin -- Token type ID generation for loss masking
- PixtralPlugin -- Image break and end tokens for Pixtral architecture
- GLM4VPlugin -- GLM-4 vision-specific placeholder handling
- Qwen2AudioPlugin, Qwen2OmniPlugin -- Audio feature processing
- And more: ErnieVL, KimiVL, Llama4, VideoLlava, LFMVL, YoutuVL, Gemma3n

A registry pattern with get_mm_plugin selects the correct plugin by name at template registration time.

Usage

Multimodal plugins are instantiated during template creation via get_mm_plugin. They are called at three stages: (1) process_messages during data conversion to inject model-specific placeholder tokens, (2) process_token_ids during tokenization to expand media tokens to the correct sequence length, and (3) get_mm_inputs during collation to generate pixel values, grid sizes, and other multimodal tensors.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/data/mm_plugin.py
Lines: 1-2241

Signature

@dataclass
class MMPluginMixin:
    image_token: str | None
    video_token: str | None
    audio_token: str | None
    expand_mm_tokens: bool = True

    def _validate_input(self, processor, images, videos, audios) -> None: ...
    def _validate_messages(self, messages, images, videos, audios) -> None: ...
    def _preprocess_image(self, image, image_max_pixels, image_min_pixels) -> "ImageObject": ...
    def _regularize_images(self, images, **kwargs) -> "RegularizedImageOutput": ...
    def _regularize_videos(self, videos, **kwargs) -> "RegularizedVideoOutput": ...
    def _regularize_audios(self, audios, sampling_rate, **kwargs) -> "RegularizedAudioOutput": ...
    def _get_mm_inputs(self, images, videos, audios, processor) -> dict[str, "torch.Tensor"]: ...

@dataclass
class BasePlugin(MMPluginMixin):
    def process_messages(self, messages, images, videos, audios, processor) -> list[dict[str, str]]: ...
    def process_token_ids(self, input_ids, labels, images, videos, audios, tokenizer, processor) -> tuple[list[int], list[int] | None]: ...
    def get_mm_inputs(self, images, videos, audios, imglens, vidlens, audlens, batch_ids, processor) -> dict[str, Any]: ...

def get_mm_plugin(name: str, image_token: str | None = None, ...) -> "BasePlugin": ...
def register_mm_plugin(name: str, mm_plugin: type["BasePlugin"]) -> None: ...

Import

from llamafactory.data.mm_plugin import get_mm_plugin, BasePlugin

I/O Contract

Inputs (process_messages)

Name	Type	Required	Description
messages	list[dict[str, str]]	Yes	Chat messages with placeholders to be processed
images	list[ImageInput]	Yes	Image inputs (paths, bytes, PIL Images, or encoded dicts)
videos	list[VideoInput]	Yes	Video inputs (paths, file objects, or nested frame lists)
audios	list[AudioInput]	Yes	Audio inputs (paths, file objects, or numpy arrays)
processor	ProcessorMixin	Yes	HuggingFace processor with image_processor and feature_extractor

Outputs (get_mm_inputs)

Name	Type	Description
pixel_values	torch.Tensor	Processed image/video pixel values with model-specific shape
image_grid_thw	torch.Tensor	Grid dimensions for Qwen2VL-family models
cross_attention_mask	torch.Tensor	Cross-attention mask for Mllama models
input_features	torch.Tensor	Audio features for audio-language models
token_type_ids	list[list[int]]	Token type IDs for PaliGemma/Gemma3 loss masking

Supported Model Plugins

Plugin Name	Target Models	Modalities
base	Default/generic VLMs	Image, Video, Audio
qwen2_vl	Qwen2-VL	Image, Video
qwen3_vl	Qwen3-VL	Image, Video
llava	LLaVA 1.5	Image
llava_next	LLaVA-NeXT	Image
llava_next_video	LLaVA-NeXT-Video	Image, Video
internvl	InternVL	Image, Video
minicpmv	MiniCPM-V	Image, Video, Audio
mllama	LLaMA-3.2-Vision	Image
paligemma	PaliGemma	Image
gemma3	Gemma-3	Image
pixtral	Pixtral	Image
glm4v	GLM-4V	Image, Video
qwen2_audio	Qwen2-Audio	Audio
qwen2_omni	Qwen2.5-Omni	Image, Video, Audio

Usage Examples

from llamafactory.data.mm_plugin import get_mm_plugin

# Get the plugin for Qwen2-VL
plugin = get_mm_plugin(
    name="qwen2_vl",
    image_token="<|image_pad|>",
    video_token="<|video_pad|>",
)

# Process messages to inject model-specific tokens
messages = plugin.process_messages(messages, images, videos, audios, processor)

# Expand token IDs for media placeholders
input_ids, labels = plugin.process_token_ids(
    input_ids, labels, images, videos, audios, tokenizer, processor
)

# Generate multimodal input tensors for the model
mm_inputs = plugin.get_mm_inputs(
    images, videos, audios,
    imglens=[1], vidlens=[0], audlens=[0],
    batch_ids=[input_ids], processor=processor,
)

Related Pages

Hiyouga_LLaMA_Factory_Chat_Template - Template that holds a reference to the mm_plugin
Hiyouga_LLaMA_Factory_Data_Collator - Collators that call get_mm_inputs during batching
Hiyouga_LLaMA_Factory_HfChatEngine - Inference engine that uses mm_plugin for multimodal input preparation
Hiyouga_LLaMA_Factory_Constants - IMAGE_PLACEHOLDER, VIDEO_PLACEHOLDER, AUDIO_PLACEHOLDER constants

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment