Workflow:Volcengine Verl Vision Language Model RL Training

Knowledge Sources	verl verl Documentation Multi-Modal Example
Domains	LLMs, Reinforcement_Learning, Vision_Language_Models, Multimodal
Last Updated	2026-02-07 18:00 GMT

Overview

End-to-end process for training vision-language models (VLMs) using reinforcement learning with GRPO in verl, enabling multimodal reasoning over image and text inputs for tasks like geometry problem solving and visual question answering.

Description

This workflow extends the standard GRPO training pipeline to handle vision-language models that process both image and text inputs. Using models like Qwen2.5-VL or Qwen3-VL, the pipeline handles multimodal data loading, image preprocessing, and RL training with VLM-specific optimizations. The workflow supports freezing the vision encoder (training only the language model), LoRA for parameter-efficient training, and sequence balancing for variable-length multimodal inputs. Both FSDP and Megatron-LM backends are supported, with vLLM or SGLang handling multimodal rollout generation.

Usage

Execute this workflow when you want to improve a vision-language model's performance on tasks requiring visual reasoning through reinforcement learning. This is appropriate for geometry problem solving (Geo3K), visual question answering, image captioning with quality optimization, or any task where a VLM receives image+text input and produces text output that can be evaluated with a reward function.

Execution Steps

Step 1: Environment Setup for VLM

Install verl with VLM-compatible inference engine. Ensure the rollout engine supports multimodal inputs (vLLM with vision model support or SGLang with multimodal capabilities). Additional dependencies for image processing may be required.

Key considerations:

vLLM must be configured with VLM-specific flags (disable_mm_preprocessor_cache, enable_chunked_prefill handling)
SGLang provides native multimodal support with the Megatron backend
Image processing libraries (Pillow, transformers vision processors) must be available
GPU memory requirements are higher due to vision encoder overhead

Step 2: Multimodal Data Preparation

Prepare datasets containing both image references and text prompts in verl's parquet format. The image data is stored as a separate column (typically referenced by path or embedded), and prompts include multimodal content markers that indicate where images should be inserted.

Key considerations:

The image_key configuration parameter specifies which parquet column contains image data
Images can be stored as file paths, base64 encoded, or as binary data in the parquet
Prompts use the VLM's native image placeholder format (e.g., Qwen-VL uses vision tokens)
Dataset examples include Geo3K (geometry with diagrams) and Pokemon (image captioning)

Step 3: VLM Model Configuration

Configure the vision-language model with RL-specific settings. This includes deciding whether to freeze the vision encoder (training only language components), setting up LoRA for parameter-efficient training, and configuring fused kernels and remove-padding optimizations.

Key considerations:

Freezing the vision encoder reduces memory and prevents catastrophic forgetting of visual features
use_remove_padding=True optimizes computation for variable-length multimodal sequences
use_fused_kernels=True enables optimized attention and MLP operations
LoRA can target only language model layers while keeping vision components frozen
KL loss coefficient may need to be higher (0.01 vs 0.001) for VLM stability

Step 4: Multimodal Rollout Generation

Generate text responses conditioned on image+text inputs using the VLM through the rollout engine. The engine processes multimodal inputs, running the vision encoder on images and the language model for text generation.

Key considerations:

Multimodal preprocessing cache should be managed carefully for memory efficiency
Tensor parallel size affects both vision encoder and language model distribution
Response length may be longer for VLM tasks (2048 tokens vs typical 1024)
The rollout engine must handle mixed-modal batches with varying numbers of images

Step 5: Reward Computation for Visual Tasks

Evaluate generated responses against visual task criteria. For geometry problems, extract the answer and compare against ground truth. For visual QA, use domain-specific evaluation metrics. Both rule-based and model-based reward functions are supported.

Key considerations:

Geometry tasks use rule-based rewards with exact answer matching
Visual captioning may use model-based rewards or reference-based metrics
The reward function may need access to the original image for evaluation
Multi-image inputs require careful reward attribution across images

Step 6: GRPO Policy Update with VLM Optimizations

Compute advantages and update the VLM policy using GRPO. VLM-specific optimizations include sequence balancing across variable-length multimodal inputs and selective gradient computation (skipping vision encoder if frozen).

Key considerations:

Sequence balancing is important due to the high variance in multimodal sequence lengths
Frozen vision encoder gradients are not computed, saving memory
Batch sizes may need to be smaller than text-only training due to image memory overhead
The group size for GRPO can be smaller for VLM tasks (n=5 vs n=16)

Step 7: Evaluation and Export

Evaluate the VLM on multimodal test sets and export the trained model. The evaluation requires running the full multimodal inference pipeline with image processing.

Key considerations:

VLM evaluation needs the complete multimodal pipeline (images + text)
LoRA adapters can be merged for deployment
Megatron checkpoints can be converted to HuggingFace format for easy sharing
Track both overall accuracy and per-image-type performance

Execution Diagram

GitHub URL

Workflow Repository