Heuristic:Volcengine Verl Layered Summon Memory Tradeoff
Metadata:
- Sources: Repo|verl|https://github.com/volcengine/verl
- Domains: Optimization, Infrastructure
- Last Updated: 2026-02-07 17:00 GMT
Overview
Use layered_summon for huge models to prevent OOM at the cost of increased synchronization latency.
Description
The layered_summon feature controls how model weights are synchronized between training and rollout engines. When disabled (default), all weights are synchronized at once, which is fast but requires enough memory to hold both copies. When enabled, weights are synchronized layer by layer, dramatically reducing peak memory but adding latency for the layer-by-layer transfer.
Usage
Enable when training huge models (70B+) that cause OOM during the weight synchronization phase between training and rollout.
The Insight
- Action: Set
trainer.layered_summon: Truein config - Value: Default is False. Enable only for huge models where OOM occurs during weight sync.
- Trade-off: Saves significant memory (prevents OOM) but makes the training-to-rollout transition slower.
- Interaction: When layered_summon is enabled, vLLM sleep_level falls back to 1, which may also cause OOM in some cases.
Reasoning
During RL training, model weights must be transferred from the training engine (FSDP/Megatron) to the rollout engine (vLLM/SGLang). Without layered_summon, this requires temporarily holding both copies in GPU memory. For huge models, this exceeds available VRAM. Layered summon transfers one layer at a time, overlapping the transfer with computation.
Code Evidence
From trainer/config/ppo_trainer.yaml:55:
# for huge model, layered summon can save memory (prevent OOM) but make it slower
layered_summon: False
And from verl/workers/rollout/vllm_rollout/vllm_rollout.py:90-91:
if config.layered_summon or (config.expert_parallel_size > 1 and not _check_vllm_version_for_sleep_level()):
logger.warning("Setting the sleep level to 1 may cause a memory overflow.")