Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory Multimodal Plugin

From Leeroopedia


Knowledge Sources
Domains Multimodal Processing, Vision-Language Models
Last Updated 2026-02-06 19:00 GMT

Overview

Concrete multimodal plugin system for processing images, videos, and audio across 20+ vision-language model architectures provided by LLaMA Factory.

Description

This module implements the multimodal abstraction layer that enables LLaMA Factory to support a wide range of vision-language and audio-language models through a unified preprocessing interface. The architecture consists of:

  • MMPluginMixin -- Base mixin defining token constants (image_token, video_token, audio_token), input validation, and common media processing methods (image resizing, video frame sampling, audio resampling)
  • BasePlugin -- Default plugin that handles message placeholder injection, token ID expansion, and multimodal input tensor generation using standard HuggingFace processor pipelines
  • 20+ Model-Specific Plugins -- Subclasses that override specific methods to match each model's expected input format:
    • Qwen2VLPlugin, Qwen3VLPlugin -- Handle mRoPE grid-based position IDs and video grid THW tensors
    • LlavaPlugin, LlavaNextPlugin, LlavaNextVideoPlugin -- Handle LLaVA-family image/video token expansion
    • InternVLPlugin -- Custom image token replacement with dynamic patch counting
    • MiniCPMVPlugin -- Special image bound and slice processing
    • MllamaPlugin -- Cross-attention mask generation for LLaMA-3.2 Vision
    • PaliGemmaPlugin, Gemma3Plugin -- Token type ID generation for loss masking
    • PixtralPlugin -- Image break and end tokens for Pixtral architecture
    • GLM4VPlugin -- GLM-4 vision-specific placeholder handling
    • Qwen2AudioPlugin, Qwen2OmniPlugin -- Audio feature processing
    • And more: ErnieVL, KimiVL, Llama4, VideoLlava, LFMVL, YoutuVL, Gemma3n

A registry pattern with get_mm_plugin selects the correct plugin by name at template registration time.

Usage

Multimodal plugins are instantiated during template creation via get_mm_plugin. They are called at three stages: (1) process_messages during data conversion to inject model-specific placeholder tokens, (2) process_token_ids during tokenization to expand media tokens to the correct sequence length, and (3) get_mm_inputs during collation to generate pixel values, grid sizes, and other multimodal tensors.

Code Reference

Source Location

Signature

@dataclass
class MMPluginMixin:
    image_token: str | None
    video_token: str | None
    audio_token: str | None
    expand_mm_tokens: bool = True

    def _validate_input(self, processor, images, videos, audios) -> None: ...
    def _validate_messages(self, messages, images, videos, audios) -> None: ...
    def _preprocess_image(self, image, image_max_pixels, image_min_pixels) -> "ImageObject": ...
    def _regularize_images(self, images, **kwargs) -> "RegularizedImageOutput": ...
    def _regularize_videos(self, videos, **kwargs) -> "RegularizedVideoOutput": ...
    def _regularize_audios(self, audios, sampling_rate, **kwargs) -> "RegularizedAudioOutput": ...
    def _get_mm_inputs(self, images, videos, audios, processor) -> dict[str, "torch.Tensor"]: ...

@dataclass
class BasePlugin(MMPluginMixin):
    def process_messages(self, messages, images, videos, audios, processor) -> list[dict[str, str]]: ...
    def process_token_ids(self, input_ids, labels, images, videos, audios, tokenizer, processor) -> tuple[list[int], list[int] | None]: ...
    def get_mm_inputs(self, images, videos, audios, imglens, vidlens, audlens, batch_ids, processor) -> dict[str, Any]: ...

def get_mm_plugin(name: str, image_token: str | None = None, ...) -> "BasePlugin": ...
def register_mm_plugin(name: str, mm_plugin: type["BasePlugin"]) -> None: ...

Import

from llamafactory.data.mm_plugin import get_mm_plugin, BasePlugin

I/O Contract

Inputs (process_messages)

Name Type Required Description
messages list[dict[str, str]] Yes Chat messages with placeholders to be processed
images list[ImageInput] Yes Image inputs (paths, bytes, PIL Images, or encoded dicts)
videos list[VideoInput] Yes Video inputs (paths, file objects, or nested frame lists)
audios list[AudioInput] Yes Audio inputs (paths, file objects, or numpy arrays)
processor ProcessorMixin Yes HuggingFace processor with image_processor and feature_extractor

Outputs (get_mm_inputs)

Name Type Description
pixel_values torch.Tensor Processed image/video pixel values with model-specific shape
image_grid_thw torch.Tensor Grid dimensions for Qwen2VL-family models
cross_attention_mask torch.Tensor Cross-attention mask for Mllama models
input_features torch.Tensor Audio features for audio-language models
token_type_ids list[list[int]] Token type IDs for PaliGemma/Gemma3 loss masking

Supported Model Plugins

Plugin Name Target Models Modalities
base Default/generic VLMs Image, Video, Audio
qwen2_vl Qwen2-VL Image, Video
qwen3_vl Qwen3-VL Image, Video
llava LLaVA 1.5 Image
llava_next LLaVA-NeXT Image
llava_next_video LLaVA-NeXT-Video Image, Video
internvl InternVL Image, Video
minicpmv MiniCPM-V Image, Video, Audio
mllama LLaMA-3.2-Vision Image
paligemma PaliGemma Image
gemma3 Gemma-3 Image
pixtral Pixtral Image
glm4v GLM-4V Image, Video
qwen2_audio Qwen2-Audio Audio
qwen2_omni Qwen2.5-Omni Image, Video, Audio

Usage Examples

from llamafactory.data.mm_plugin import get_mm_plugin

# Get the plugin for Qwen2-VL
plugin = get_mm_plugin(
    name="qwen2_vl",
    image_token="<|image_pad|>",
    video_token="<|video_pad|>",
)

# Process messages to inject model-specific tokens
messages = plugin.process_messages(messages, images, videos, audios, processor)

# Expand token IDs for media placeholders
input_ids, labels = plugin.process_token_ids(
    input_ids, labels, images, videos, audios, tokenizer, processor
)

# Generate multimodal input tensors for the model
mm_inputs = plugin.get_mm_inputs(
    images, videos, audios,
    imglens=[1], vidlens=[0], audlens=[0],
    batch_ids=[input_ids], processor=processor,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment