Environment:Alibaba ROLL SGLang Inference Environment

Knowledge Sources	Alibaba ROLL SGLang
Domains	Infrastructure, LLM_Inference
Last Updated	2026-02-07 19:00 GMT

Overview

SGLang inference backend environment supporting versions 0.4.6, 0.4.10, 0.5.2, 0.5.4, and 0.5.5 with strict version enforcement and version-specific patches.

Description

This environment provides the SGLang inference backend for ROLL. Unlike vLLM which issues a warning for unsupported versions, SGLang raises NotImplementedError for any untested version, making version compliance strictly mandatory. The framework applies version-specific patches for engine initialization, NCCL configuration, and CUDA settings. SGLang removes `PYTORCH_CUDA_ALLOC_CONF` to avoid conflicts and configures `CUDA_DEVICE_MAX_CONNECTIONS=4` and `CUDA_MODULE_LOADING=AUTO`.

Usage

Use this environment when configuring actor_infer workers with the SGLang backend. SGLang provides continuous batching and automatic batch management, making `infer_batch_size` configuration ineffective. The number of inference engines is determined by `len(device_mapping) // num_gpus_per_worker`.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU with CUDA support	ROCm not tested for SGLang
VRAM	Controlled by `mem_fraction_static`	Default 0.7 (70% of GPU memory)

Dependencies

Python Packages

`sglang[srt,torch-memory-saver]` == 0.4.6.post4 (torch 2.6.0) or 0.5.2 (torch 2.8.0)
`cuda-bindings` == 12.9.0 (torch 2.6.0 setup only)
`transformers` == 4.51.1 (torch 2.6.0 setup)
`flashinfer` (installed as SGLang dependency)

Credentials

`CUDA_DEVICE_MAX_CONNECTIONS`: Set to `4` (configured internally)
`CUDA_MODULE_LOADING`: Set to `AUTO` (configured internally)
`TRTLLM_ENABLE_PDL`: TensorRT-LLM PDL flag (default `1`)
`NCCL_CUMEM_ENABLE`: Controlled by `enable_symm_mem` setting
`NCCL_NVLS_ENABLE`: Controlled by `enable_nccl_nvls` setting
`FLASHINFER_WORKSPACE_BASE`: Per-worker workspace directory

Quick Install

# For torch 2.6.0 setup
pip install "sglang[srt,torch-memory-saver]==0.4.6.post4"
pip install cuda-bindings==12.9.0
pip install transformers==4.51.1

# For torch 2.8.0 setup
pip install "sglang[srt,torch-memory-saver]==0.5.2"

# Or use combined requirements
pip install -r requirements_torch260_sglang.txt
pip install -r requirements_torch280_sglang.txt

Code Evidence

Strict version enforcement from `roll/third_party/sglang/__init__.py:4-23`:

if sgl.__version__ == '0.4.6.post4':
    from roll.third_party.sglang import v046post4_patch
    patch = v046post4_patch
elif sgl.__version__ == '0.5.2':
    from roll.third_party.sglang import v052_patch
    patch = v052_patch
elif sgl.__version__ == '0.5.4.post2':
    from roll.third_party.sglang import v054_patch
    patch = v054_patch
else:
    raise NotImplementedError(
        f"Scale aligner version sglang:{sgl.__version__} is not supported."
    )

CUDA environment configuration from SGLang patch `roll/third_party/sglang/v054_patch/engine.py:18-25`:

os.environ["NCCL_CUMEM_ENABLE"] = str(int(server_args.enable_symm_mem))
os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "4"
os.environ["CUDA_MODULE_LOADING"] = "AUTO"

Common Errors

Error Message	Cause	Solution
`NotImplementedError: Scale aligner version sglang:X is not supported`	SGLang version not in supported list	Install exact version: 0.4.6.post4, 0.5.2, or 0.5.4.post2
KV cache building failure	`mem_fraction_static` too low	Increase `mem_fraction_static` (default 0.7)
CUDA memory insufficient	`mem_fraction_static` too high	Decrease `mem_fraction_static`

Compatibility Notes

Strict version enforcement: Unlike vLLM, SGLang raises an error for unsupported versions.
ROCm: Not tested; NVIDIA CUDA only.
Ascend NPU: Not supported.
Continuous batching: `infer_batch_size` setting has no effect (automatic batching).
Triton bug: SGLang v0.4.6 includes a workaround for a Triton compiler bug (triton-lang/triton#4295).

Related Pages

Implementation:Alibaba_ROLL_VllmStrategy_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment