Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Volcengine Verl VLM Model Config

From Leeroopedia


Field Value
Knowledge Sources verl source code, model and actor configuration modules
Domains Vision-Language Models, Model Configuration, VLM Training
Last Updated 2026-02-07

Overview

Description

VLM (Vision-Language Model) training in verl is configured through the standard HFModelConfig dataclass extended with VLM-specific fields spread across the model and actor configuration layers. These fields control how the visual encoder is handled during RL training, how padding is managed for variable-length multi-modal inputs, and which model modules are excluded from LoRA adaptation.

The key VLM-specific configuration fields are:

  • freeze_vision_tower (ActorConfig) -- When True, the visual encoder (e.g., ViT) parameters are frozen during RL training. Only the language model layers receive gradient updates. This is the standard practice since the vision encoder is typically pre-trained and does not need further tuning during RLHF.
  • use_remove_padding (HFModelConfig and FSDPActorConfig) -- When True, padding tokens are removed from inputs before the forward pass, which is critical for VLMs where image token sequences vary in length across samples.
  • use_fused_kernels (HFModelConfig and ActorConfig) -- When True, enables custom fused kernels (e.g., FlashAttention, fused MLP) for memory-efficient computation with multi-modal inputs.
  • exclude_modules (HFModelConfig) -- A regex pattern such as ".*visual.*" used to exclude visual encoder modules from LoRA adaptation, ensuring only language model layers are adapted.

Usage

Configure these fields in the YAML trainer config file under the actor_rollout_ref.model and actor_rollout_ref.actor sections. The configuration is validated at initialization time.

Code Reference

Field Value
HFModelConfig Source verl/workers/config/model.py, Lines 72-209 (class HFModelConfig)
ActorConfig Source verl/workers/config/actor.py, Lines 84-167 (class ActorConfig)
FSDPActorConfig Source verl/workers/config/actor.py, Lines 264-309 (class FSDPActorConfig)
Import from verl.workers.config.model import HFModelConfig

I/O Contract

Inputs (Configuration Fields)

Field Type Default Location Description
freeze_vision_tower bool False ActorConfig Freeze the visual encoder during training. Set to True for VLM RLHF.
use_remove_padding bool True HFModelConfig Remove padding from inputs for variable-length sequences. Essential for VLM batch processing.
use_remove_padding bool False FSDPActorConfig Remove padding during FSDP actor training. Must be True when using sequence parallelism.
use_fused_kernels bool False HFModelConfig Enable fused kernels for memory-efficient multi-modal forward pass.
use_fused_kernels bool False ActorConfig Enable fused kernels during actor training step.
exclude_modules Optional[str] None HFModelConfig Regex pattern for modules to exclude from LoRA (e.g., ".*visual.*").
lora_rank int 0 HFModelConfig LoRA rank. Set to >0 to enable LoRA adaptation.
target_modules Optional[str] "all-linear" HFModelConfig Target modules for LoRA. Combined with exclude_modules to skip visual layers.
enable_gradient_checkpointing bool True HFModelConfig Gradient checkpointing for memory savings with large VLMs.

Outputs

Artifact Description
hf_config Loaded AutoConfig object with override settings applied, used to instantiate the model.
tokenizer Loaded tokenizer instance for text processing.
processor Loaded processor instance (e.g., Qwen2VLProcessor) for multi-modal input processing.

Usage Examples

YAML configuration for VLM RLHF training:

# In a Hydra/OmegaConf YAML config file:
#
# actor_rollout_ref:
#   model:
#     path: "Qwen/Qwen2-VL-7B-Instruct"
#     use_remove_padding: true
#     use_fused_kernels: true
#     enable_gradient_checkpointing: true
#     exclude_modules: ".*visual.*"
#     lora_rank: 16
#     lora_alpha: 32
#     target_modules: "all-linear"
#   actor:
#     strategy: fsdp
#     freeze_vision_tower: true
#     use_fused_kernels: true
#     use_remove_padding: true

Programmatic configuration:

from verl.workers.config.model import HFModelConfig

# Initialize model config for a VLM
model_config = HFModelConfig(
    path="Qwen/Qwen2-VL-7B-Instruct",
    use_remove_padding=True,
    use_fused_kernels=True,
    enable_gradient_checkpointing=True,
    exclude_modules=".*visual.*",
    lora_rank=16,
    lora_alpha=32,
    target_modules="all-linear",
    trust_remote_code=True,
)

# The processor is auto-loaded and supports multi-modal inputs
processor = model_config.get_processor()
print(f"Processor type: {type(processor).__name__}")
print(f"Has image processor: {hasattr(processor, 'image_processor')}")

Checking VLM-specific config at runtime:

# In actor worker initialization
if actor_config.freeze_vision_tower:
    # Freeze all parameters matching visual encoder patterns
    for name, param in model.named_parameters():
        if "visual" in name or "vision" in name:
            param.requires_grad = False

if model_config.exclude_modules:
    # LoRA will skip modules matching this pattern
    print(f"Excluding modules matching: {model_config.exclude_modules}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment