Environment:Lm sys FastChat GPU CUDA Inference
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
CUDA GPU environment for running FastChat model workers with HuggingFace Transformers, supporting multi-GPU inference, 8-bit quantization, and multiple accelerator backends (CUDA, XPU, NPU, MPS).
Description
This environment provides GPU-accelerated inference for FastChat's model serving layer. It supports five device types: CUDA (NVIDIA), XPU (Intel), NPU (Huawei Ascend), MPS (Apple Silicon), and CPU. The primary path uses CUDA with float16 precision, automatic multi-GPU memory distribution at 85% utilization, and optional 8-bit quantization via BitsAndBytes. Specialized inference backends (vLLM, SGLang, LightLLM, ExLlamaV2, AWQ, GPTQ, xFasterTransformer, MLX, DashInfer) each require their own additional packages.
Usage
Use this environment for any Model Serving workflow that loads and runs models through FastChat workers. It is the mandatory prerequisite for running `model_worker.py`, `vllm_worker.py`, `sglang_worker.py`, and all other inference worker backends.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 20.04+ (Linux) | Docker image uses `nvidia/cuda:12.2.0-runtime-ubuntu20.04` |
| Hardware | NVIDIA GPU with 16GB+ VRAM | A100 preferred; consumer GPUs (RTX 3090/4090) work for smaller models |
| CUDA | 12.2 (Docker default) | Other CUDA versions may work depending on PyTorch build |
| Disk | 50GB+ SSD | For model weight caching at `~/.cache/huggingface` |
Dependencies
System Packages
- `nvidia-driver` — NVIDIA GPU driver
- `cuda-toolkit` — CUDA runtime (12.2 in Docker)
Python Packages (model_worker extra)
- `torch` — PyTorch with CUDA support
- `transformers` >= 4.31.0 — Model loading
- `accelerate` >= 0.21 — Multi-GPU distribution and device mapping
- `peft` — LoRA adapter loading (lazy import)
- `sentencepiece` — LLaMA tokenizer support
- `protobuf` — Protocol buffers
Optional Backend Packages
- `vllm` — vLLM high-throughput inference (unguarded top-level import in `vllm_worker.py`)
- `sglang` — SGLang inference (unguarded top-level import in `sglang_worker.py`)
- `lightllm` — LightLLM inference (unguarded top-level import in `lightllm_worker.py`)
- `exllamav2` — ExLlamaV2 quantized inference (sys.exit on import failure)
- `tinychat` (from AWQ) — AWQ quantized inference (sys.exit on import failure)
- `GPTQ-for-LLaMa` — GPTQ quantization (cloned repo at `../repositories/GPTQ-for-LLaMa`)
- `xfastertransformer` — Intel xFasterTransformer CPU inference
- `mlx`, `mlx_lm` — Apple MLX inference (Apple Silicon only)
- `dashinfer` — DashInfer CPU inference
- `intel_extension_for_pytorch` — Required for XPU, optional for CPU with AVX-512/AMX
- `torch_npu` — Required for Huawei Ascend NPU
Credentials
The following environment variables must be set for specific features:
- `CUDA_VISIBLE_DEVICES`: Set via `--gpus` argument to restrict GPU visibility
- `XPU_VISIBLE_DEVICES`: Set via `--gpus` argument for Intel XPU
- `FASTCHAT_USE_MODELSCOPE`: Set to `true` to download from ModelScope instead of HuggingFace
- `PEFT_SHARE_BASE_WEIGHTS`: Set to `true` to share base weights across PEFT adapters
- `CPU_ISA`: Set to `avx512_bf16` or `amx` for Intel CPU optimization
- `SSL_KEYFILE` / `SSL_CERTFILE`: Required when `--ssl` flag is used
Quick Install
# Standard CUDA inference
pip install "fschat[model_worker]"
# With vLLM backend
pip install "fschat[model_worker]" vllm
# With ExLlamaV2 backend
pip install "fschat[model_worker]" exllamav2
# Docker (recommended for production)
docker compose -f docker/docker-compose.yml up
Code Evidence
Device selection and dtype mapping from `fastchat/model/model_adapter.py:226-281`:
if device == "cpu":
kwargs = {"torch_dtype": torch.float32}
if CPU_ISA in ["avx512_bf16", "amx"]:
import intel_extension_for_pytorch as ipex
kwargs = {"torch_dtype": torch.bfloat16}
elif device == "cuda":
kwargs = {"torch_dtype": torch.float16}
elif device == "mps":
kwargs = {"torch_dtype": torch.float16}
elif device == "xpu":
kwargs = {"torch_dtype": torch.bfloat16}
elif device == "npu":
kwargs = {"torch_dtype": torch.float16}
Multi-GPU memory allocation at 85% from `fastchat/model/model_adapter.py:241-249`:
if num_gpus != 1:
kwargs["device_map"] = "auto"
if max_gpu_memory is None:
kwargs["device_map"] = "sequential"
available_gpu_memory = get_gpu_memory(num_gpus)
kwargs["max_memory"] = {
i: str(int(available_gpu_memory[i] * 0.85)) + "GiB"
for i in range(num_gpus)
}
CUDA OOM error handling from `fastchat/serve/model_worker.py:133-138`:
except torch.cuda.OutOfMemoryError as e:
ret = {
"text": f"{SERVER_ERROR_MSG}\n\n({e})",
"error_code": ErrorCode.CUDA_OUT_OF_MEMORY,
}
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `torch.cuda.OutOfMemoryError` | Model too large for GPU VRAM | Use `--max-gpu-memory` to limit allocation, or use quantization (`--load-8bit`) |
| `8-bit quantization is not supported for multi-gpu inference` | BitsAndBytes 8-bit with multiple GPUs | Use single GPU for 8-bit, or use AWQ/GPTQ for multi-GPU quantized inference |
| `Intel Extension for PyTorch is not installed, but is required for xpu inference` | IPEX missing on XPU device | `pip install intel_extension_for_pytorch` |
| `Error: Failed to load Exllamav2` | ExLlamaV2 package not installed | `pip install exllamav2` |
| `Error: Failed to load GPTQ-for-LLaMa` | GPTQ repo not cloned | Clone `GPTQ-for-LLaMa` into `../repositories/` |
Compatibility Notes
- Multi-GPU: Uses `device_map="sequential"` (not `"auto"`) when GPU memory sizes differ, to correctly handle heterogeneous setups.
- MPS (Apple Silicon): Requires transformers >= 4.35.0 to avoid in-place operation bugs; older versions get a monkey patch applied automatically.
- XPU (Intel GPU): Uses bfloat16 dtype and requires `intel_extension_for_pytorch`. Model is optimized via `torch.xpu.optimize()`.
- NPU (Huawei Ascend): Requires `torch_npu` package. Sets device via `torch_npu.npu.set_device("npu:0")`.
- CPU offloading: Linux + CUDA only. Uses half of available system RAM (`psutil.virtual_memory().available / 2`).
- vLLM/SGLang/LightLLM: These perform unguarded top-level imports, so they must be installed before the worker script can even be imported.
- Docker: The `docker-compose.yml` reserves 1 NVIDIA GPU per model worker and mounts a named volume for the HuggingFace cache.