Heuristic:Unslothai Unsloth VLLM Memory Utilization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Reinforcement_Learning, Memory_Management |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
Unsloth defaults `gpu_memory_utilization=0.5` for vLLM colocate mode, reserving half of VRAM for training while vLLM uses the other half for KV cache and inference.
Description
In GRPO reinforcement learning, vLLM runs on the same GPU as the training process (colocate mode). The `gpu_memory_utilization` parameter controls what fraction of total VRAM vLLM allocates for its KV cache. The default of 0.5 splits VRAM equally between training (model weights + optimizer states + activations) and inference (vLLM KV cache). With TRL >= 0.23.0, sleep mode is available: vLLM releases its memory during training steps and reclaims it during generation steps, effectively allowing higher utilization of the same GPU.
Usage
Use `gpu_memory_utilization=0.5` (default) for colocated GRPO training on a single GPU. Lower to `0.3-0.4` if you experience CUDA OOM during training steps. Increase to `0.6-0.7` if you need longer generation sequences (more KV cache) and have sufficient VRAM. Enable sleep mode (`UNSLOTH_VLLM_STANDBY=1`) with TRL >= 0.23.0 for dynamic VRAM sharing.
The Insight (Rule of Thumb)
- Action: Set `gpu_memory_utilization=0.5` for colocate mode (default). Enable sleep mode for dynamic sharing.
- Value: Default: 0.5. RL training defaults: `per_device_train_batch_size=4`, `gradient_accumulation_steps=2`, `num_generations=8`.
- Trade-off: Higher utilization = more KV cache for longer generations but less VRAM for training. Sleep mode eliminates this trade-off at the cost of vLLM startup/shutdown latency.
- Compatibility: Colocate mode requires TRL >= 0.18.0. Sleep mode requires TRL >= 0.23.0.
Reasoning
In colocate mode, the training process and vLLM inference engine share a single GPU. Unlike server mode (where vLLM runs on a separate GPU), colocate mode requires careful VRAM partitioning. The 0.5 default was chosen because: (1) 4-bit quantized model weights + 8-bit optimizer states typically consume 40-50% of VRAM for a model that fits on one GPU, and (2) vLLM's KV cache needs the remaining VRAM for generation. The RL training defaults (`batch_size=4, grad_accum=2, num_generations=8`) are tuned to work within this budget.
Default gpu_memory_utilization from `models/loader.py:143`:
gpu_memory_utilization = 0.5,
RL training defaults from `models/rl.py:832-865`:
replacements = {
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 2,
"num_generations": 8,
"vllm_mode": "colocate",
"optim": "adamw_8bit",
"learning_rate": 5e-05,
"torch_empty_cache_steps": 250,
"auto_find_batch_size": False, # Too many people complained so removing
...
}
Sleep mode enablement from `models/rl.py:1354-1361`:
if trl_version >= Version("0.23.0"):
vllm_setter += (
" " * 12
+ "if os.environ.get('UNSLOTH_VLLM_STANDBY', '0') == '1':\n"
+ " " * 16
+ "args.vllm_enable_sleep_mode=True\n"
)
Batch size auto-adjustment from `models/rl.py:986-1000`:
check_num_generations = (
"if steps_per_generation is None and generation_batch_size is None:\n"
" ga = gradient_accumulation_steps\n"
" world_size = int(os.environ.get('WORLD_SIZE', '1'))\n"
" if (ga * world_size * per_device_train_batch_size) % num_generations != 0:\n"
" per_device_train_batch_size = num_generations\n"
)