Principle:Hiyouga LLaMA Factory Multimodal Processing

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Multimodal Learning, Computer Vision, Natural Language Processing, Audio Processing
Last Updated	2026-02-06 19:00 GMT

Overview

A multimodal data processing framework that enables language models to accept and reason over images, videos, and audio alongside text by converting non-textual inputs into token-space representations compatible with the transformer architecture.

Description

Multimodal processing extends large language models beyond text to handle images, videos, and audio inputs. This is accomplished through a pipeline that: (1) identifies multimodal content in conversation messages, (2) preprocesses the raw media using model-specific processors, and (3) integrates the processed representations into the token sequence that the language model consumes.

The multimodal processing system addresses several challenges:

Diverse model architectures: Different vision-language models (LLaVA, Qwen2-VL, Llava-OneVision, Mllama, PaliGemma, etc.) use different strategies for integrating visual information -- some use image tokens in the text sequence, others use cross-attention, and others use special grid representations.
Multiple modalities: Images, videos, and audio each require different preprocessing pipelines, from image resizing and normalization to video frame extraction and audio resampling.
Placeholder management: Multimodal placeholders (<image>, <video>, <audio>) in the conversation text must be replaced with the correct number of special tokens corresponding to the processed media representation.
Training considerations: During training, special attention must be paid to which tokens receive gradients (the image tokens in the input typically do not contribute to the loss), and vision tower parameters may optionally be frozen.

The plugin architecture provides a unified interface with model-specific implementations. Each plugin handles:

Message processing: Inserting multimodal placeholders into conversation messages.
Token processing: Replacing placeholder tokens with the correct number of image/video/audio tokens.
Batch processing: Loading and preprocessing the actual media files (images, video frames, audio waveforms) and producing the tensor inputs expected by the model.
Padding: Handling the variable-length nature of multimodal inputs within batched training.

Usage

Use multimodal processing when you want to:

Fine-tune vision-language models (VLMs) on image-text or video-text datasets.
Train models that combine visual understanding with language generation.
Process datasets containing interleaved text, images, videos, and audio.
Adapt existing VLMs to new visual domains or tasks.

Multimodal processing is activated automatically when the model is a composite (vision-language) model and the dataset contains image, video, or audio fields.

Theoretical Basis

Vision-Language Model Architecture

Modern vision-language models typically follow a three-component architecture:

$VLM (x_{text}, x_{image}) = LLM (Embed (x_{text}) \oplus Proj (VE (x_{image})))$

where:

$VE$ is a vision encoder (e.g., CLIP ViT, SigLIP) that extracts visual features.
$Proj$ is a projector (linear layer or MLP) that maps visual features to the language model's embedding space.
$LLM$ is the language model backbone.
$\oplus$ denotes the interleaving of text and visual tokens in the sequence.

Image Token Representation

An image of spatial resolution $H \times W$ is processed by the vision encoder into a grid of features, which are then projected into $N$ visual tokens:

$N = \frac{H \times W}{P^{2}}$

where $P$ is the vision encoder's patch size. For example, a 336x336 image with patch size 14 produces $24 \times 24 = 576$ visual tokens. Some models apply additional spatial pooling or token merging to reduce this count.

Video Processing

Videos are processed by extracting a sequence of frames and treating each frame as an image:

$frames (V) = {V_{t_{1}}, V_{t_{2}}, \dots, V_{t_{k}}}$

where $k$ is the number of sampled frames determined by the model's configuration. Frames are sampled uniformly across the video duration. The total number of visual tokens for a video is:

$N_{video} = k \times N_{frame}$

Some models (e.g., Qwen2-VL) use temporal-spatial grid representations that encode frame position information through 3D positional embeddings with image_grid_thw (temporal, height, width) metadata.

Audio Processing

Audio inputs are processed by:

Resampling to the model's expected sample rate (typically 16 kHz).
Extracting features via a feature extractor (e.g., Whisper encoder) that produces a sequence of audio tokens.
Projecting audio features into the language model's embedding space.

Failed to parse (syntax error): {\displaystyle N_{\text{audio}} = \left\lfloor \frac{\text{duration} \times \text{sample\_rate}}{\text{hop\_length}} \right\rfloor }

Cross-Attention Integration

Some models (e.g., Mllama) use cross-attention rather than token interleaving. In this approach, visual features attend to text representations through dedicated cross-attention layers:

$CrossAttn (Q_{text}, K_{image}, V_{image}) = softmax (\frac{Q_{text} K_{image}^{⊤}}{\sqrt{d}}) V_{image}$

This requires a cross-attention mask that indicates which text tokens should attend to which image features, encoded as a sparse-to-dense attention mask.

Composite Model Training

During multimodal fine-tuning, different components may be trained or frozen independently:

Component	Typical Training Strategy
Vision Encoder	Frozen (optionally unfrozen for full fine-tuning)
Projector	Trainable (critical for alignment)
Language Model	LoRA adapters or full fine-tuning

The freeze_vision_tower parameter controls whether the vision encoder's parameters are included in the trainable set. When frozen, the vision encoder's modules are excluded from LoRA target module discovery.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment