Heuristic:Alibaba ROLL GPU Memory Offload Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
GPU time-division multiplexing via state offload/reload that enables colocated workers to share GPU memory, with CPU offloading always enabled for distributed checkpointing to prevent OOM.
Description
ROLL implements a GPU time-division multiplexing strategy where multiple worker roles (actor_train, actor_infer, reference) can share the same GPUs by offloading unused model states to CPU. When a worker is not actively computing, its states are moved to CPU RAM; when needed, states are reloaded to GPU. The framework also spreads CPU workers across nodes to avoid memory concentration and peak usage spikes. For distributed checkpointing (DCP), CPU offload is always enabled to prevent OOM during save/load operations. DeepSpeed optimizer state offloading defaults to True during train_step, though the offload implementation currently only supports optimizer parameters (not gradients).
Usage
Apply this strategy when colocating multiple worker roles on the same GPUs (e.g., actor_train and actor_infer sharing devices). Also use CPU offloading when training large models that approach GPU memory limits. For separate-device deployments, offloading is less critical but still helpful for checkpointing.
The Insight (Rule of Thumb)
- Action: Enable colocated mode with offload/reload for multi-role GPU sharing. Always use `cpu_offload=True` for DCP checkpointing.
- Value: DeepSpeed optimizer state offload defaults to `True`. FSDP2 DCP always uses CPU offload.
- Trade-off: Colocated mode saves GPU resources but adds offload/reload overhead (several seconds per transition). Separate mode avoids this overhead but requires more GPUs.
- Memory distribution: Spread CPU workers across nodes to avoid OOM from memory concentration.
Reasoning
LLM RL training uses multiple model roles (policy, reference, critic, reward) that each require significant GPU memory. Running all roles simultaneously on separate GPUs requires 4-6x the GPU count. Time-division multiplexing reduces this by allowing roles to take turns on the same hardware. The critical insight for DCP is that saving/loading model states temporarily requires double the memory (old state + new state), making CPU offload mandatory to prevent OOM.
CPU offload for DCP from `roll/distributed/strategy/fsdp2_strategy.py:153`:
# Always use cpu_offload=True for DCP to avoid OOM during load/save
Worker spreading from `roll/distributed/scheduler/resource_manager.py:133-134`:
# Try to spread the CPU workers across various nodes to avoid the
# out-of-memory (OOM) situation caused by the concentration of CPU
# workers in one place and the resulting peak memory usage.
Optimizer state offload default from `roll/distributed/strategy/deepspeed_strategy.py:456`:
is_offload_optimizer_states_in_train_step = data.meta_info.get(
"is_offload_optimizer_states_in_train_step", True
)