Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Alibaba ROLL SGLang Inference Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Inference
Last Updated 2026-02-07 19:00 GMT

Overview

SGLang inference backend environment supporting versions 0.4.6, 0.4.10, 0.5.2, 0.5.4, and 0.5.5 with strict version enforcement and version-specific patches.

Description

This environment provides the SGLang inference backend for ROLL. Unlike vLLM which issues a warning for unsupported versions, SGLang raises NotImplementedError for any untested version, making version compliance strictly mandatory. The framework applies version-specific patches for engine initialization, NCCL configuration, and CUDA settings. SGLang removes `PYTORCH_CUDA_ALLOC_CONF` to avoid conflicts and configures `CUDA_DEVICE_MAX_CONNECTIONS=4` and `CUDA_MODULE_LOADING=AUTO`.

Usage

Use this environment when configuring actor_infer workers with the SGLang backend. SGLang provides continuous batching and automatic batch management, making `infer_batch_size` configuration ineffective. The number of inference engines is determined by `len(device_mapping) // num_gpus_per_worker`.

System Requirements

Category Requirement Notes
Hardware NVIDIA GPU with CUDA support ROCm not tested for SGLang
VRAM Controlled by `mem_fraction_static` Default 0.7 (70% of GPU memory)

Dependencies

Python Packages

  • `sglang[srt,torch-memory-saver]` == 0.4.6.post4 (torch 2.6.0) or 0.5.2 (torch 2.8.0)
  • `cuda-bindings` == 12.9.0 (torch 2.6.0 setup only)
  • `transformers` == 4.51.1 (torch 2.6.0 setup)
  • `flashinfer` (installed as SGLang dependency)

Credentials

  • `CUDA_DEVICE_MAX_CONNECTIONS`: Set to `4` (configured internally)
  • `CUDA_MODULE_LOADING`: Set to `AUTO` (configured internally)
  • `TRTLLM_ENABLE_PDL`: TensorRT-LLM PDL flag (default `1`)
  • `NCCL_CUMEM_ENABLE`: Controlled by `enable_symm_mem` setting
  • `NCCL_NVLS_ENABLE`: Controlled by `enable_nccl_nvls` setting
  • `FLASHINFER_WORKSPACE_BASE`: Per-worker workspace directory

Quick Install

# For torch 2.6.0 setup
pip install "sglang[srt,torch-memory-saver]==0.4.6.post4"
pip install cuda-bindings==12.9.0
pip install transformers==4.51.1

# For torch 2.8.0 setup
pip install "sglang[srt,torch-memory-saver]==0.5.2"

# Or use combined requirements
pip install -r requirements_torch260_sglang.txt
pip install -r requirements_torch280_sglang.txt

Code Evidence

Strict version enforcement from `roll/third_party/sglang/__init__.py:4-23`:

if sgl.__version__ == '0.4.6.post4':
    from roll.third_party.sglang import v046post4_patch
    patch = v046post4_patch
elif sgl.__version__ == '0.5.2':
    from roll.third_party.sglang import v052_patch
    patch = v052_patch
elif sgl.__version__ == '0.5.4.post2':
    from roll.third_party.sglang import v054_patch
    patch = v054_patch
else:
    raise NotImplementedError(
        f"Scale aligner version sglang:{sgl.__version__} is not supported."
    )

CUDA environment configuration from SGLang patch `roll/third_party/sglang/v054_patch/engine.py:18-25`:

os.environ["NCCL_CUMEM_ENABLE"] = str(int(server_args.enable_symm_mem))
os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "4"
os.environ["CUDA_MODULE_LOADING"] = "AUTO"

Common Errors

Error Message Cause Solution
`NotImplementedError: Scale aligner version sglang:X is not supported` SGLang version not in supported list Install exact version: 0.4.6.post4, 0.5.2, or 0.5.4.post2
KV cache building failure `mem_fraction_static` too low Increase `mem_fraction_static` (default 0.7)
CUDA memory insufficient `mem_fraction_static` too high Decrease `mem_fraction_static`

Compatibility Notes

  • Strict version enforcement: Unlike vLLM, SGLang raises an error for unsupported versions.
  • ROCm: Not tested; NVIDIA CUDA only.
  • Ascend NPU: Not supported.
  • Continuous batching: `infer_batch_size` setting has no effect (automatic batching).
  • Triton bug: SGLang v0.4.6 includes a workaround for a Triton compiler bug (triton-lang/triton#4295).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment