Principle:Volcengine Verl VLM Model Configuration
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language_Models, Model_Architecture, Configuration |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Configuration of vision-language models for RL training, including vision encoder freezing, multimodal input handling, and VLM-specific compute optimizations.
Description
VLM Model Configuration extends standard model configuration with settings specific to vision-language models. Key additions:
- Vision encoder freezing: Option to freeze the vision tower during RL training so only the language model receives gradient updates, preserving visual understanding
- Padding removal: Critical optimization for VLM efficiency, removes padding tokens and processes only meaningful tokens
- Fused kernels: VLM-specific fused attention kernels for multimodal inputs
- Module exclusion: Ability to exclude vision modules from LoRA adaptation (e.g.,
exclude_modules=".*visual.*")
VLM models supported include Qwen2.5-VL, Qwen3-VL, GLM-4.1V, and MiniCPM-o.
Usage
Use VLM model configuration when setting up RL training for vision-language models. Key decisions:
- Freeze vision tower (
freeze_vision_tower=True) to preserve visual capabilities - Enable padding removal (
use_remove_padding=True) for efficiency - Exclude vision modules from LoRA (
exclude_modules=".*visual.*")
Theoretical Basis
VLM model configuration adds vision-specific settings to standard LLM configuration:
# Abstract VLM configuration decisions
model_config = HFModelConfig(
path="Qwen/Qwen2.5-VL-7B-Instruct",
use_remove_padding=True, # Critical for VLM efficiency
use_fused_kernels=True, # Fused multimodal attention
exclude_modules=".*visual.*", # No LoRA on vision tower
)
actor_config = ActorConfig(
freeze_vision_tower=True, # Freeze vision encoder
)
Related Pages
Implemented By
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment