Principle:Hiyouga LLaMA Factory Multimodal Processing
| Knowledge Sources | |
|---|---|
| Domains | Multimodal Learning, Computer Vision, Natural Language Processing, Audio Processing |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
A multimodal data processing framework that enables language models to accept and reason over images, videos, and audio alongside text by converting non-textual inputs into token-space representations compatible with the transformer architecture.
Description
Multimodal processing extends large language models beyond text to handle images, videos, and audio inputs. This is accomplished through a pipeline that: (1) identifies multimodal content in conversation messages, (2) preprocesses the raw media using model-specific processors, and (3) integrates the processed representations into the token sequence that the language model consumes.
The multimodal processing system addresses several challenges:
- Diverse model architectures: Different vision-language models (LLaVA, Qwen2-VL, Llava-OneVision, Mllama, PaliGemma, etc.) use different strategies for integrating visual information -- some use image tokens in the text sequence, others use cross-attention, and others use special grid representations.
- Multiple modalities: Images, videos, and audio each require different preprocessing pipelines, from image resizing and normalization to video frame extraction and audio resampling.
- Placeholder management: Multimodal placeholders (
<image>,<video>,<audio>) in the conversation text must be replaced with the correct number of special tokens corresponding to the processed media representation. - Training considerations: During training, special attention must be paid to which tokens receive gradients (the image tokens in the input typically do not contribute to the loss), and vision tower parameters may optionally be frozen.
The plugin architecture provides a unified interface with model-specific implementations. Each plugin handles:
- Message processing: Inserting multimodal placeholders into conversation messages.
- Token processing: Replacing placeholder tokens with the correct number of image/video/audio tokens.
- Batch processing: Loading and preprocessing the actual media files (images, video frames, audio waveforms) and producing the tensor inputs expected by the model.
- Padding: Handling the variable-length nature of multimodal inputs within batched training.
Usage
Use multimodal processing when you want to:
- Fine-tune vision-language models (VLMs) on image-text or video-text datasets.
- Train models that combine visual understanding with language generation.
- Process datasets containing interleaved text, images, videos, and audio.
- Adapt existing VLMs to new visual domains or tasks.
Multimodal processing is activated automatically when the model is a composite (vision-language) model and the dataset contains image, video, or audio fields.
Theoretical Basis
Vision-Language Model Architecture
Modern vision-language models typically follow a three-component architecture:
where:
- is a vision encoder (e.g., CLIP ViT, SigLIP) that extracts visual features.
- is a projector (linear layer or MLP) that maps visual features to the language model's embedding space.
- is the language model backbone.
- denotes the interleaving of text and visual tokens in the sequence.
Image Token Representation
An image of spatial resolution is processed by the vision encoder into a grid of features, which are then projected into visual tokens:
where is the vision encoder's patch size. For example, a 336x336 image with patch size 14 produces visual tokens. Some models apply additional spatial pooling or token merging to reduce this count.
Video Processing
Videos are processed by extracting a sequence of frames and treating each frame as an image:
where is the number of sampled frames determined by the model's configuration. Frames are sampled uniformly across the video duration. The total number of visual tokens for a video is:
Some models (e.g., Qwen2-VL) use temporal-spatial grid representations that encode frame position information through 3D positional embeddings with image_grid_thw (temporal, height, width) metadata.
Audio Processing
Audio inputs are processed by:
- Resampling to the model's expected sample rate (typically 16 kHz).
- Extracting features via a feature extractor (e.g., Whisper encoder) that produces a sequence of audio tokens.
- Projecting audio features into the language model's embedding space.
Failed to parse (syntax error): {\displaystyle N_{\text{audio}} = \left\lfloor \frac{\text{duration} \times \text{sample\_rate}}{\text{hop\_length}} \right\rfloor }
Cross-Attention Integration
Some models (e.g., Mllama) use cross-attention rather than token interleaving. In this approach, visual features attend to text representations through dedicated cross-attention layers:
This requires a cross-attention mask that indicates which text tokens should attend to which image features, encoded as a sparse-to-dense attention mask.
Composite Model Training
During multimodal fine-tuning, different components may be trained or frozen independently:
| Component | Typical Training Strategy |
|---|---|
| Vision Encoder | Frozen (optionally unfrozen for full fine-tuning) |
| Projector | Trainable (critical for alignment) |
| Language Model | LoRA adapters or full fine-tuning |
The freeze_vision_tower parameter controls whether the vision encoder's parameters are included in the trainable set. When frozen, the vision encoder's modules are excluded from LoRA target module discovery.