Environment:Alibaba ROLL SGLang Inference Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Inference |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
SGLang inference backend environment supporting versions 0.4.6, 0.4.10, 0.5.2, 0.5.4, and 0.5.5 with strict version enforcement and version-specific patches.
Description
This environment provides the SGLang inference backend for ROLL. Unlike vLLM which issues a warning for unsupported versions, SGLang raises NotImplementedError for any untested version, making version compliance strictly mandatory. The framework applies version-specific patches for engine initialization, NCCL configuration, and CUDA settings. SGLang removes `PYTORCH_CUDA_ALLOC_CONF` to avoid conflicts and configures `CUDA_DEVICE_MAX_CONNECTIONS=4` and `CUDA_MODULE_LOADING=AUTO`.
Usage
Use this environment when configuring actor_infer workers with the SGLang backend. SGLang provides continuous batching and automatic batch management, making `infer_batch_size` configuration ineffective. The number of inference engines is determined by `len(device_mapping) // num_gpus_per_worker`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA GPU with CUDA support | ROCm not tested for SGLang |
| VRAM | Controlled by `mem_fraction_static` | Default 0.7 (70% of GPU memory) |
Dependencies
Python Packages
- `sglang[srt,torch-memory-saver]` == 0.4.6.post4 (torch 2.6.0) or 0.5.2 (torch 2.8.0)
- `cuda-bindings` == 12.9.0 (torch 2.6.0 setup only)
- `transformers` == 4.51.1 (torch 2.6.0 setup)
- `flashinfer` (installed as SGLang dependency)
Credentials
- `CUDA_DEVICE_MAX_CONNECTIONS`: Set to `4` (configured internally)
- `CUDA_MODULE_LOADING`: Set to `AUTO` (configured internally)
- `TRTLLM_ENABLE_PDL`: TensorRT-LLM PDL flag (default `1`)
- `NCCL_CUMEM_ENABLE`: Controlled by `enable_symm_mem` setting
- `NCCL_NVLS_ENABLE`: Controlled by `enable_nccl_nvls` setting
- `FLASHINFER_WORKSPACE_BASE`: Per-worker workspace directory
Quick Install
# For torch 2.6.0 setup
pip install "sglang[srt,torch-memory-saver]==0.4.6.post4"
pip install cuda-bindings==12.9.0
pip install transformers==4.51.1
# For torch 2.8.0 setup
pip install "sglang[srt,torch-memory-saver]==0.5.2"
# Or use combined requirements
pip install -r requirements_torch260_sglang.txt
pip install -r requirements_torch280_sglang.txt
Code Evidence
Strict version enforcement from `roll/third_party/sglang/__init__.py:4-23`:
if sgl.__version__ == '0.4.6.post4':
from roll.third_party.sglang import v046post4_patch
patch = v046post4_patch
elif sgl.__version__ == '0.5.2':
from roll.third_party.sglang import v052_patch
patch = v052_patch
elif sgl.__version__ == '0.5.4.post2':
from roll.third_party.sglang import v054_patch
patch = v054_patch
else:
raise NotImplementedError(
f"Scale aligner version sglang:{sgl.__version__} is not supported."
)
CUDA environment configuration from SGLang patch `roll/third_party/sglang/v054_patch/engine.py:18-25`:
os.environ["NCCL_CUMEM_ENABLE"] = str(int(server_args.enable_symm_mem))
os.environ["CUDA_DEVICE_MAX_CONNECTIONS"] = "4"
os.environ["CUDA_MODULE_LOADING"] = "AUTO"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `NotImplementedError: Scale aligner version sglang:X is not supported` | SGLang version not in supported list | Install exact version: 0.4.6.post4, 0.5.2, or 0.5.4.post2 |
| KV cache building failure | `mem_fraction_static` too low | Increase `mem_fraction_static` (default 0.7) |
| CUDA memory insufficient | `mem_fraction_static` too high | Decrease `mem_fraction_static` |
Compatibility Notes
- Strict version enforcement: Unlike vLLM, SGLang raises an error for unsupported versions.
- ROCm: Not tested; NVIDIA CUDA only.
- Ascend NPU: Not supported.
- Continuous batching: `infer_batch_size` setting has no effect (automatic batching).
- Triton bug: SGLang v0.4.6 includes a workaround for a Triton compiler bug (triton-lang/triton#4295).