Implementation:Volcengine Verl VLM Model Config

Field	Value
Knowledge Sources	verl source code, model and actor configuration modules
Domains	Vision-Language Models, Model Configuration, VLM Training
Last Updated	2026-02-07

Overview

Description

VLM (Vision-Language Model) training in verl is configured through the standard HFModelConfig dataclass extended with VLM-specific fields spread across the model and actor configuration layers. These fields control how the visual encoder is handled during RL training, how padding is managed for variable-length multi-modal inputs, and which model modules are excluded from LoRA adaptation.

The key VLM-specific configuration fields are:

freeze_vision_tower (ActorConfig) -- When True, the visual encoder (e.g., ViT) parameters are frozen during RL training. Only the language model layers receive gradient updates. This is the standard practice since the vision encoder is typically pre-trained and does not need further tuning during RLHF.

use_remove_padding (HFModelConfig and FSDPActorConfig) -- When True, padding tokens are removed from inputs before the forward pass, which is critical for VLMs where image token sequences vary in length across samples.

use_fused_kernels (HFModelConfig and ActorConfig) -- When True, enables custom fused kernels (e.g., FlashAttention, fused MLP) for memory-efficient computation with multi-modal inputs.

exclude_modules (HFModelConfig) -- A regex pattern such as ".*visual.*" used to exclude visual encoder modules from LoRA adaptation, ensuring only language model layers are adapted.

Usage

Configure these fields in the YAML trainer config file under the actor_rollout_ref.model and actor_rollout_ref.actor sections. The configuration is validated at initialization time.

Code Reference

Field	Value
HFModelConfig Source	`verl/workers/config/model.py`, Lines 72-209 (class `HFModelConfig`)
ActorConfig Source	`verl/workers/config/actor.py`, Lines 84-167 (class `ActorConfig`)
FSDPActorConfig Source	`verl/workers/config/actor.py`, Lines 264-309 (class `FSDPActorConfig`)
Import	`from verl.workers.config.model import HFModelConfig`

I/O Contract

Inputs (Configuration Fields)

Field	Type	Default	Location	Description
`freeze_vision_tower`	`bool`	`False`	`ActorConfig`	Freeze the visual encoder during training. Set to `True` for VLM RLHF.
`use_remove_padding`	`bool`	`True`	`HFModelConfig`	Remove padding from inputs for variable-length sequences. Essential for VLM batch processing.
`use_remove_padding`	`bool`	`False`	`FSDPActorConfig`	Remove padding during FSDP actor training. Must be `True` when using sequence parallelism.
`use_fused_kernels`	`bool`	`False`	`HFModelConfig`	Enable fused kernels for memory-efficient multi-modal forward pass.
`use_fused_kernels`	`bool`	`False`	`ActorConfig`	Enable fused kernels during actor training step.
`exclude_modules`	`Optional[str]`	`None`	`HFModelConfig`	Regex pattern for modules to exclude from LoRA (e.g., `".visual."`).
`lora_rank`	`int`	`0`	`HFModelConfig`	LoRA rank. Set to >0 to enable LoRA adaptation.
`target_modules`	`Optional[str]`	`"all-linear"`	`HFModelConfig`	Target modules for LoRA. Combined with `exclude_modules` to skip visual layers.
`enable_gradient_checkpointing`	`bool`	`True`	`HFModelConfig`	Gradient checkpointing for memory savings with large VLMs.

Outputs

Artifact	Description
`hf_config`	Loaded `AutoConfig` object with override settings applied, used to instantiate the model.
`tokenizer`	Loaded tokenizer instance for text processing.
`processor`	Loaded processor instance (e.g., `Qwen2VLProcessor`) for multi-modal input processing.

Usage Examples

YAML configuration for VLM RLHF training:

# In a Hydra/OmegaConf YAML config file:
#
# actor_rollout_ref:
#   model:
#     path: "Qwen/Qwen2-VL-7B-Instruct"
#     use_remove_padding: true
#     use_fused_kernels: true
#     enable_gradient_checkpointing: true
#     exclude_modules: ".*visual.*"
#     lora_rank: 16
#     lora_alpha: 32
#     target_modules: "all-linear"
#   actor:
#     strategy: fsdp
#     freeze_vision_tower: true
#     use_fused_kernels: true
#     use_remove_padding: true

Programmatic configuration:

from verl.workers.config.model import HFModelConfig

# Initialize model config for a VLM
model_config = HFModelConfig(
    path="Qwen/Qwen2-VL-7B-Instruct",
    use_remove_padding=True,
    use_fused_kernels=True,
    enable_gradient_checkpointing=True,
    exclude_modules=".*visual.*",
    lora_rank=16,
    lora_alpha=32,
    target_modules="all-linear",
    trust_remote_code=True,
)

# The processor is auto-loaded and supports multi-modal inputs
processor = model_config.get_processor()
print(f"Processor type: {type(processor).__name__}")
print(f"Has image processor: {hasattr(processor, 'image_processor')}")

Checking VLM-specific config at runtime:

# In actor worker initialization
if actor_config.freeze_vision_tower:
    # Freeze all parameters matching visual encoder patterns
    for name, param in model.named_parameters():
        if "visual" in name or "vision" in name:
            param.requires_grad = False

if model_config.exclude_modules:
    # LoRA will skip modules matching this pattern
    print(f"Excluding modules matching: {model_config.exclude_modules}")

Related Pages

Principle:Volcengine_Verl_VLM_Model_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment