Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Unslothai Unsloth CUDA VLLM

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Reinforcement_Learning, Inference
Last Updated 2026-02-07 09:00 GMT

Overview

CUDA GPU environment with vLLM for fast inference during GRPO reinforcement learning, AIME evaluation, and colocated training-inference workflows.

Description

This environment extends CUDA_BitsAndBytes with the vLLM inference engine for high-throughput generation during RL training. vLLM runs in colocate mode alongside the training process on the same GPU, using sleep mode to share VRAM between training and inference phases. The environment patches TRL's GRPO trainer to use a pre-loaded vLLM engine instead of spawning a separate server. Multiple vLLM version compatibility shims are applied to handle API renames and Blackwell GPU issues.

Usage

Use this environment for any workflow that sets `fast_inference=True` in `FastLanguageModel.from_pretrained()`, including GRPO reinforcement learning and AIME math benchmark evaluation. Required when vLLM-based generation is needed for RL rollouts or model evaluation.

System Requirements

Category Requirement Notes
OS Linux vLLM does not support Windows or macOS
Hardware NVIDIA GPU with compute capability >= 7.0 Blackwell (SM100) requires torch >= 2.9.0 with vLLM
VRAM Minimum 16GB vLLM colocate mode shares VRAM; `gpu_memory_utilization=0.5` default
CUDA 11.8, 12.1, 12.4, 12.6, 12.8, or 13.0 Must match PyTorch CUDA version
Disk 50GB+ SSD For model weights and vLLM KV cache

Dependencies

System Packages

  • All packages from CUDA_BitsAndBytes environment
  • `cuda-toolkit` matching torch version

Python Packages

  • `vllm` (latest compatible version; < 0.13.2 for PDL fix workaround)
  • `torch` >= 2.1.0 (>= 2.9.0 for Blackwell GPUs)
  • `trl` >= 0.18.2 (>= 0.18.0 for colocate mode, >= 0.23.0 for sleep mode)
  • All packages from CUDA_BitsAndBytes and Python_Transformers environments

Credentials

  • `HF_TOKEN`: HuggingFace API token (for gated model access).

Quick Install

# Install vLLM (in addition to base Unsloth install)
pip install vllm

# For Blackwell GPUs, ensure torch >= 2.9.0
pip install "torch>=2.9.0" vllm

Code Evidence

vLLM import check from `models/loader.py:234-239`:

if fast_inference:
    if importlib.util.find_spec("vllm") is None:
        raise ImportError(
            "Unsloth: Please install vLLM before enabling `fast_inference`!\n"
            "You can do this in a terminal via `pip install vllm`"
        )

vLLM colocate mode setup from `models/rl.py:1351-1361`:

if "grpo" in trainer_file and trl_version >= Version("0.18.0"):
    vllm_setter += " " * 12 + "args.vllm_mode='colocate'\n"
    if trl_version >= Version("0.23.0"):
        vllm_setter += (
            " " * 12
            + "if os.environ.get('UNSLOTH_VLLM_STANDBY', '0') == '1':\n"
            + " " * 16
            + "args.vllm_enable_sleep_mode=True\n"
        )

Blackwell GPU compatibility check from `import_fixes.py:835-862`:

def check_vllm_torch_sm100_compatibility():
    if torch_version >= Version("2.9.0"):
        return  # torch >= 2.9.0 is compatible
    for i in range(torch.cuda.device_count()):
        major, minor = torch.cuda.get_device_capability(i)
        if major == 10:
            has_sm100 = True
    raise RuntimeError(
        "vLLM's distributed module crashes with std::bad_alloc on SM100 GPUs..."
    )

vLLM + transformers 5.0 mismatch from `import_fixes.py:350-367`:

def _maybe_raise_vllm_transformers_mismatch(error):
    if "ALLOWED_LAYER_TYPES" in error_text or "transformers.configuration_utils" in error_text:
        raise RuntimeError(
            f"Unsloth: vLLM with version {vllm_version} does not yet support transformers>=5.0.0. "
            'Please downgrade to transformers==4.57.3...'
        )

Common Errors

Error Message Cause Solution
`ImportError: Please install vLLM before enabling fast_inference` vLLM not installed `pip install vllm`
`vLLM's distributed module crashes with std::bad_alloc on SM100 GPUs` Blackwell GPU with torch < 2.9.0 Upgrade to `torch>=2.9.0`
`vLLM does not yet support transformers>=5.0.0` transformers too new for vLLM `pip install transformers==4.57.3`
`GuidedDecodingParams` import error vLLM renamed class to StructuredOutputsParams Unsloth auto-patches this; update vLLM if persistent
`DGX Spark detected - fast_inference=True is currently broken` DGX Spark GPU (GB10) Set `fast_inference=False`

Compatibility Notes

  • Colocate mode: vLLM runs on the same GPU as training (no separate server). Requires TRL >= 0.18.0. Sleep mode (TRL >= 0.23.0) auto-frees vLLM VRAM during training steps.
  • Blackwell GPUs (SM100): Requires torch >= 2.9.0. Triton PDL is auto-disabled (`TRITON_DISABLE_PDL=1`).
  • DGX Spark (GB10): `fast_inference=True` is auto-disabled.
  • Notebooks (Colab/Kaggle): vLLM >= 0.12.0 requires stdout patching for notebook compatibility (`sys.stdout.fileno = lambda: 1`).
  • Vision models: Only a subset of VLMs support vLLM inference: Qwen2.5-VL, Gemma 3, Mistral 3, Qwen3-VL, Qwen3-VL-MoE.
  • gpu_memory_utilization: Defaults to 0.5 in colocate mode, reserving half of VRAM for training. Adjustable via `gpu_memory_utilization` parameter.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment