Heuristic:Unslothai Unsloth VLLM Memory Utilization

Knowledge Sources	Unsloth vLLM
Domains	Optimization, Reinforcement_Learning, Memory_Management
Last Updated	2026-02-07 09:00 GMT

Overview

Unsloth defaults `gpu_memory_utilization=0.5` for vLLM colocate mode, reserving half of VRAM for training while vLLM uses the other half for KV cache and inference.

Description

In GRPO reinforcement learning, vLLM runs on the same GPU as the training process (colocate mode). The `gpu_memory_utilization` parameter controls what fraction of total VRAM vLLM allocates for its KV cache. The default of 0.5 splits VRAM equally between training (model weights + optimizer states + activations) and inference (vLLM KV cache). With TRL >= 0.23.0, sleep mode is available: vLLM releases its memory during training steps and reclaims it during generation steps, effectively allowing higher utilization of the same GPU.

Usage

Use `gpu_memory_utilization=0.5` (default) for colocated GRPO training on a single GPU. Lower to `0.3-0.4` if you experience CUDA OOM during training steps. Increase to `0.6-0.7` if you need longer generation sequences (more KV cache) and have sufficient VRAM. Enable sleep mode (`UNSLOTH_VLLM_STANDBY=1`) with TRL >= 0.23.0 for dynamic VRAM sharing.

The Insight (Rule of Thumb)

Action: Set `gpu_memory_utilization=0.5` for colocate mode (default). Enable sleep mode for dynamic sharing.
Value: Default: 0.5. RL training defaults: `per_device_train_batch_size=4`, `gradient_accumulation_steps=2`, `num_generations=8`.
Trade-off: Higher utilization = more KV cache for longer generations but less VRAM for training. Sleep mode eliminates this trade-off at the cost of vLLM startup/shutdown latency.
Compatibility: Colocate mode requires TRL >= 0.18.0. Sleep mode requires TRL >= 0.23.0.

Reasoning

In colocate mode, the training process and vLLM inference engine share a single GPU. Unlike server mode (where vLLM runs on a separate GPU), colocate mode requires careful VRAM partitioning. The 0.5 default was chosen because: (1) 4-bit quantized model weights + 8-bit optimizer states typically consume 40-50% of VRAM for a model that fits on one GPU, and (2) vLLM's KV cache needs the remaining VRAM for generation. The RL training defaults (`batch_size=4, grad_accum=2, num_generations=8`) are tuned to work within this budget.

Default gpu_memory_utilization from `models/loader.py:143`:

gpu_memory_utilization = 0.5,

RL training defaults from `models/rl.py:832-865`:

replacements = {
    "per_device_train_batch_size": 4,
    "gradient_accumulation_steps": 2,
    "num_generations": 8,
    "vllm_mode": "colocate",
    "optim": "adamw_8bit",
    "learning_rate": 5e-05,
    "torch_empty_cache_steps": 250,
    "auto_find_batch_size": False,  # Too many people complained so removing
    ...
}

Sleep mode enablement from `models/rl.py:1354-1361`:

if trl_version >= Version("0.23.0"):
    vllm_setter += (
        " " * 12
        + "if os.environ.get('UNSLOTH_VLLM_STANDBY', '0') == '1':\n"
        + " " * 16
        + "args.vllm_enable_sleep_mode=True\n"
    )

Batch size auto-adjustment from `models/rl.py:986-1000`:

check_num_generations = (
    "if steps_per_generation is None and generation_batch_size is None:\n"
    "    ga = gradient_accumulation_steps\n"
    "    world_size = int(os.environ.get('WORLD_SIZE', '1'))\n"
    "    if (ga * world_size * per_device_train_batch_size) % num_generations != 0:\n"
    "        per_device_train_batch_size = num_generations\n"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment