Principle:Volcengine Verl VLM Model Configuration

Knowledge Sources	Qwen2-VL: Enhancing Vision-Language Understanding verl
Domains	Vision_Language_Models, Model_Architecture, Configuration
Last Updated	2026-02-07 14:00 GMT

Overview

Configuration of vision-language models for RL training, including vision encoder freezing, multimodal input handling, and VLM-specific compute optimizations.

Description

VLM Model Configuration extends standard model configuration with settings specific to vision-language models. Key additions:

Vision encoder freezing: Option to freeze the vision tower during RL training so only the language model receives gradient updates, preserving visual understanding
Padding removal: Critical optimization for VLM efficiency, removes padding tokens and processes only meaningful tokens
Fused kernels: VLM-specific fused attention kernels for multimodal inputs
Module exclusion: Ability to exclude vision modules from LoRA adaptation (e.g., exclude_modules=".*visual.*")

VLM models supported include Qwen2.5-VL, Qwen3-VL, GLM-4.1V, and MiniCPM-o.

Usage

Use VLM model configuration when setting up RL training for vision-language models. Key decisions:

Freeze vision tower (freeze_vision_tower=True) to preserve visual capabilities
Enable padding removal (use_remove_padding=True) for efficiency
Exclude vision modules from LoRA (exclude_modules=".*visual.*")

Theoretical Basis

VLM model configuration adds vision-specific settings to standard LLM configuration:

# Abstract VLM configuration decisions
model_config = HFModelConfig(
    path="Qwen/Qwen2.5-VL-7B-Instruct",
    use_remove_padding=True,   # Critical for VLM efficiency
    use_fused_kernels=True,    # Fused multimodal attention
    exclude_modules=".*visual.*",  # No LoRA on vision tower
)
actor_config = ActorConfig(
    freeze_vision_tower=True,  # Freeze vision encoder
)

Related Pages

Implemented By

Implementation:Volcengine_Verl_VLM_Model_Config

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment