Heuristic:Volcengine Verl GPU Memory Utilization Tuning

Metadata:

Sources: Repo|verl|https://github.com/volcengine/verl
Domains: Optimization, Infrastructure
Last Updated: 2026-02-07 17:00 GMT

Overview

Tuning gpu_memory_utilization parameter for vLLM/SGLang rollout to balance KV cache allocation with training memory needs.

Description

The gpu_memory_utilization parameter controls what fraction of GPU memory is allocated to the vLLM/SGLang KV cache during rollout. This is critical because GPU memory is shared between training (FSDP/Megatron weights, gradients, optimizer states) and inference (KV cache). Setting it too high causes OOM during training; too low reduces rollout throughput.

Usage

Use this when configuring the rollout engine's memory allocation, especially when hitting OOM errors during the training-rollout transition or when rollout throughput is low.

The Insight (Rule of Thumb)

Action: Set rollout.gpu_memory_utilization in the training config
Value: Default is 0.5 (50%). Use 0.6-0.8 for most models. Use 0.4-0.5 for large models (30B+). Use 0.9-0.95 for inference-only VLA tasks.
Trade-off: Higher values give more KV cache (faster rollout) but risk OOM during training. Lower values are safer but reduce rollout throughput.
Free cache engine: Set free_cache_engine: True (default) to release KV cache memory when transitioning to training.

Reasoning

During RL training, GPU memory alternates between training (forward+backward) and rollout (KV cache for generation). The free_cache_engine flag releases the KV cache during training, but the vLLM/SGLang engine still reserves memory for model weights. Training scripts in the repo show empirical values: 0.5 for conservative, 0.7-0.8 for optimized, 0.95 for VLA tasks.

Code Evidence

From trainer/config/rollout/rollout.yaml:

gpu_memory_utilization: 0.5  # Fraction of GPU memory used by vLLM/SGLang for KV cache
free_cache_engine: True  # Release KV cache memory during training

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment