Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Volcengine Verl VLM Model Configuration

From Leeroopedia


Knowledge Sources
Domains Vision_Language_Models, Model_Architecture, Configuration
Last Updated 2026-02-07 14:00 GMT

Overview

Configuration of vision-language models for RL training, including vision encoder freezing, multimodal input handling, and VLM-specific compute optimizations.

Description

VLM Model Configuration extends standard model configuration with settings specific to vision-language models. Key additions:

  • Vision encoder freezing: Option to freeze the vision tower during RL training so only the language model receives gradient updates, preserving visual understanding
  • Padding removal: Critical optimization for VLM efficiency, removes padding tokens and processes only meaningful tokens
  • Fused kernels: VLM-specific fused attention kernels for multimodal inputs
  • Module exclusion: Ability to exclude vision modules from LoRA adaptation (e.g., exclude_modules=".*visual.*")

VLM models supported include Qwen2.5-VL, Qwen3-VL, GLM-4.1V, and MiniCPM-o.

Usage

Use VLM model configuration when setting up RL training for vision-language models. Key decisions:

  • Freeze vision tower (freeze_vision_tower=True) to preserve visual capabilities
  • Enable padding removal (use_remove_padding=True) for efficiency
  • Exclude vision modules from LoRA (exclude_modules=".*visual.*")

Theoretical Basis

VLM model configuration adds vision-specific settings to standard LLM configuration:

# Abstract VLM configuration decisions
model_config = HFModelConfig(
    path="Qwen/Qwen2.5-VL-7B-Instruct",
    use_remove_padding=True,   # Critical for VLM efficiency
    use_fused_kernels=True,    # Fused multimodal attention
    exclude_modules=".*visual.*",  # No LoRA on vision tower
)
actor_config = ActorConfig(
    freeze_vision_tower=True,  # Freeze vision encoder
)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment