Heuristic:Volcengine Verl GPU Memory Utilization Tuning
Metadata:
- Sources: Repo|verl|https://github.com/volcengine/verl
- Domains: Optimization, Infrastructure
- Last Updated: 2026-02-07 17:00 GMT
Overview
Tuning gpu_memory_utilization parameter for vLLM/SGLang rollout to balance KV cache allocation with training memory needs.
Description
The gpu_memory_utilization parameter controls what fraction of GPU memory is allocated to the vLLM/SGLang KV cache during rollout. This is critical because GPU memory is shared between training (FSDP/Megatron weights, gradients, optimizer states) and inference (KV cache). Setting it too high causes OOM during training; too low reduces rollout throughput.
Usage
Use this when configuring the rollout engine's memory allocation, especially when hitting OOM errors during the training-rollout transition or when rollout throughput is low.
The Insight (Rule of Thumb)
- Action: Set
rollout.gpu_memory_utilizationin the training config - Value: Default is 0.5 (50%). Use 0.6-0.8 for most models. Use 0.4-0.5 for large models (30B+). Use 0.9-0.95 for inference-only VLA tasks.
- Trade-off: Higher values give more KV cache (faster rollout) but risk OOM during training. Lower values are safer but reduce rollout throughput.
- Free cache engine: Set
free_cache_engine: True(default) to release KV cache memory when transitioning to training.
Reasoning
During RL training, GPU memory alternates between training (forward+backward) and rollout (KV cache for generation). The free_cache_engine flag releases the KV cache during training, but the vLLM/SGLang engine still reserves memory for model weights. Training scripts in the repo show empirical values: 0.5 for conservative, 0.7-0.8 for optimized, 0.95 for VLA tasks.
Code Evidence
From trainer/config/rollout/rollout.yaml:
gpu_memory_utilization: 0.5 # Fraction of GPU memory used by vLLM/SGLang for KV cache
free_cache_engine: True # Release KV cache memory during training