Implementation:Volcengine Verl VLM Model Config
| Field | Value |
|---|---|
| Knowledge Sources | verl source code, model and actor configuration modules |
| Domains | Vision-Language Models, Model Configuration, VLM Training |
| Last Updated | 2026-02-07 |
Overview
Description
VLM (Vision-Language Model) training in verl is configured through the standard HFModelConfig dataclass extended with VLM-specific fields spread across the model and actor configuration layers. These fields control how the visual encoder is handled during RL training, how padding is managed for variable-length multi-modal inputs, and which model modules are excluded from LoRA adaptation.
The key VLM-specific configuration fields are:
- freeze_vision_tower (
ActorConfig) -- WhenTrue, the visual encoder (e.g., ViT) parameters are frozen during RL training. Only the language model layers receive gradient updates. This is the standard practice since the vision encoder is typically pre-trained and does not need further tuning during RLHF.
- use_remove_padding (
HFModelConfigandFSDPActorConfig) -- WhenTrue, padding tokens are removed from inputs before the forward pass, which is critical for VLMs where image token sequences vary in length across samples.
- use_fused_kernels (
HFModelConfigandActorConfig) -- WhenTrue, enables custom fused kernels (e.g., FlashAttention, fused MLP) for memory-efficient computation with multi-modal inputs.
- exclude_modules (
HFModelConfig) -- A regex pattern such as".*visual.*"used to exclude visual encoder modules from LoRA adaptation, ensuring only language model layers are adapted.
Usage
Configure these fields in the YAML trainer config file under the actor_rollout_ref.model and actor_rollout_ref.actor sections. The configuration is validated at initialization time.
Code Reference
| Field | Value |
|---|---|
| HFModelConfig Source | verl/workers/config/model.py, Lines 72-209 (class HFModelConfig)
|
| ActorConfig Source | verl/workers/config/actor.py, Lines 84-167 (class ActorConfig)
|
| FSDPActorConfig Source | verl/workers/config/actor.py, Lines 264-309 (class FSDPActorConfig)
|
| Import | from verl.workers.config.model import HFModelConfig
|
I/O Contract
Inputs (Configuration Fields)
| Field | Type | Default | Location | Description |
|---|---|---|---|---|
freeze_vision_tower |
bool |
False |
ActorConfig |
Freeze the visual encoder during training. Set to True for VLM RLHF.
|
use_remove_padding |
bool |
True |
HFModelConfig |
Remove padding from inputs for variable-length sequences. Essential for VLM batch processing. |
use_remove_padding |
bool |
False |
FSDPActorConfig |
Remove padding during FSDP actor training. Must be True when using sequence parallelism.
|
use_fused_kernels |
bool |
False |
HFModelConfig |
Enable fused kernels for memory-efficient multi-modal forward pass. |
use_fused_kernels |
bool |
False |
ActorConfig |
Enable fused kernels during actor training step. |
exclude_modules |
Optional[str] |
None |
HFModelConfig |
Regex pattern for modules to exclude from LoRA (e.g., ".*visual.*").
|
lora_rank |
int |
0 |
HFModelConfig |
LoRA rank. Set to >0 to enable LoRA adaptation. |
target_modules |
Optional[str] |
"all-linear" |
HFModelConfig |
Target modules for LoRA. Combined with exclude_modules to skip visual layers.
|
enable_gradient_checkpointing |
bool |
True |
HFModelConfig |
Gradient checkpointing for memory savings with large VLMs. |
Outputs
| Artifact | Description |
|---|---|
hf_config |
Loaded AutoConfig object with override settings applied, used to instantiate the model.
|
tokenizer |
Loaded tokenizer instance for text processing. |
processor |
Loaded processor instance (e.g., Qwen2VLProcessor) for multi-modal input processing.
|
Usage Examples
YAML configuration for VLM RLHF training:
# In a Hydra/OmegaConf YAML config file:
#
# actor_rollout_ref:
# model:
# path: "Qwen/Qwen2-VL-7B-Instruct"
# use_remove_padding: true
# use_fused_kernels: true
# enable_gradient_checkpointing: true
# exclude_modules: ".*visual.*"
# lora_rank: 16
# lora_alpha: 32
# target_modules: "all-linear"
# actor:
# strategy: fsdp
# freeze_vision_tower: true
# use_fused_kernels: true
# use_remove_padding: true
Programmatic configuration:
from verl.workers.config.model import HFModelConfig
# Initialize model config for a VLM
model_config = HFModelConfig(
path="Qwen/Qwen2-VL-7B-Instruct",
use_remove_padding=True,
use_fused_kernels=True,
enable_gradient_checkpointing=True,
exclude_modules=".*visual.*",
lora_rank=16,
lora_alpha=32,
target_modules="all-linear",
trust_remote_code=True,
)
# The processor is auto-loaded and supports multi-modal inputs
processor = model_config.get_processor()
print(f"Processor type: {type(processor).__name__}")
print(f"Has image processor: {hasattr(processor, 'image_processor')}")
Checking VLM-specific config at runtime:
# In actor worker initialization
if actor_config.freeze_vision_tower:
# Freeze all parameters matching visual encoder patterns
for name, param in model.named_parameters():
if "visual" in name or "vision" in name:
param.requires_grad = False
if model_config.exclude_modules:
# LoRA will skip modules matching this pattern
print(f"Excluding modules matching: {model_config.exclude_modules}")