Environment:Lm sys FastChat GPU CUDA Inference

Knowledge Sources	lm-sys/FastChat NVIDIA NGC CUDA
Domains	Infrastructure, Deep_Learning
Last Updated	2026-02-07 04:00 GMT

Overview

CUDA GPU environment for running FastChat model workers with HuggingFace Transformers, supporting multi-GPU inference, 8-bit quantization, and multiple accelerator backends (CUDA, XPU, NPU, MPS).

Description

This environment provides GPU-accelerated inference for FastChat's model serving layer. It supports five device types: CUDA (NVIDIA), XPU (Intel), NPU (Huawei Ascend), MPS (Apple Silicon), and CPU. The primary path uses CUDA with float16 precision, automatic multi-GPU memory distribution at 85% utilization, and optional 8-bit quantization via BitsAndBytes. Specialized inference backends (vLLM, SGLang, LightLLM, ExLlamaV2, AWQ, GPTQ, xFasterTransformer, MLX, DashInfer) each require their own additional packages.

Usage

Use this environment for any Model Serving workflow that loads and runs models through FastChat workers. It is the mandatory prerequisite for running `model_worker.py`, `vllm_worker.py`, `sglang_worker.py`, and all other inference worker backends.

System Requirements

Category	Requirement	Notes
OS	Ubuntu 20.04+ (Linux)	Docker image uses `nvidia/cuda:12.2.0-runtime-ubuntu20.04`
Hardware	NVIDIA GPU with 16GB+ VRAM	A100 preferred; consumer GPUs (RTX 3090/4090) work for smaller models
CUDA	12.2 (Docker default)	Other CUDA versions may work depending on PyTorch build
Disk	50GB+ SSD	For model weight caching at `~/.cache/huggingface`

Dependencies

System Packages

`nvidia-driver` — NVIDIA GPU driver
`cuda-toolkit` — CUDA runtime (12.2 in Docker)

Python Packages (model_worker extra)

`torch` — PyTorch with CUDA support
`transformers` >= 4.31.0 — Model loading
`accelerate` >= 0.21 — Multi-GPU distribution and device mapping
`peft` — LoRA adapter loading (lazy import)
`sentencepiece` — LLaMA tokenizer support
`protobuf` — Protocol buffers

Optional Backend Packages

`vllm` — vLLM high-throughput inference (unguarded top-level import in `vllm_worker.py`)
`sglang` — SGLang inference (unguarded top-level import in `sglang_worker.py`)
`lightllm` — LightLLM inference (unguarded top-level import in `lightllm_worker.py`)
`exllamav2` — ExLlamaV2 quantized inference (sys.exit on import failure)
`tinychat` (from AWQ) — AWQ quantized inference (sys.exit on import failure)
`GPTQ-for-LLaMa` — GPTQ quantization (cloned repo at `../repositories/GPTQ-for-LLaMa`)
`xfastertransformer` — Intel xFasterTransformer CPU inference
`mlx`, `mlx_lm` — Apple MLX inference (Apple Silicon only)
`dashinfer` — DashInfer CPU inference
`intel_extension_for_pytorch` — Required for XPU, optional for CPU with AVX-512/AMX
`torch_npu` — Required for Huawei Ascend NPU

Credentials

The following environment variables must be set for specific features:

`CUDA_VISIBLE_DEVICES`: Set via `--gpus` argument to restrict GPU visibility
`XPU_VISIBLE_DEVICES`: Set via `--gpus` argument for Intel XPU
`FASTCHAT_USE_MODELSCOPE`: Set to `true` to download from ModelScope instead of HuggingFace
`PEFT_SHARE_BASE_WEIGHTS`: Set to `true` to share base weights across PEFT adapters
`CPU_ISA`: Set to `avx512_bf16` or `amx` for Intel CPU optimization
`SSL_KEYFILE` / `SSL_CERTFILE`: Required when `--ssl` flag is used

Quick Install

# Standard CUDA inference
pip install "fschat[model_worker]"

# With vLLM backend
pip install "fschat[model_worker]" vllm

# With ExLlamaV2 backend
pip install "fschat[model_worker]" exllamav2

# Docker (recommended for production)
docker compose -f docker/docker-compose.yml up

Code Evidence

Device selection and dtype mapping from `fastchat/model/model_adapter.py:226-281`:

if device == "cpu":
    kwargs = {"torch_dtype": torch.float32}
    if CPU_ISA in ["avx512_bf16", "amx"]:
        import intel_extension_for_pytorch as ipex
        kwargs = {"torch_dtype": torch.bfloat16}
elif device == "cuda":
    kwargs = {"torch_dtype": torch.float16}
elif device == "mps":
    kwargs = {"torch_dtype": torch.float16}
elif device == "xpu":
    kwargs = {"torch_dtype": torch.bfloat16}
elif device == "npu":
    kwargs = {"torch_dtype": torch.float16}

Multi-GPU memory allocation at 85% from `fastchat/model/model_adapter.py:241-249`:

if num_gpus != 1:
    kwargs["device_map"] = "auto"
    if max_gpu_memory is None:
        kwargs["device_map"] = "sequential"
        available_gpu_memory = get_gpu_memory(num_gpus)
        kwargs["max_memory"] = {
            i: str(int(available_gpu_memory[i] * 0.85)) + "GiB"
            for i in range(num_gpus)
        }

CUDA OOM error handling from `fastchat/serve/model_worker.py:133-138`:

except torch.cuda.OutOfMemoryError as e:
    ret = {
        "text": f"{SERVER_ERROR_MSG}\n\n({e})",
        "error_code": ErrorCode.CUDA_OUT_OF_MEMORY,
    }

Common Errors

Error Message	Cause	Solution
`torch.cuda.OutOfMemoryError`	Model too large for GPU VRAM	Use `--max-gpu-memory` to limit allocation, or use quantization (`--load-8bit`)
`8-bit quantization is not supported for multi-gpu inference`	BitsAndBytes 8-bit with multiple GPUs	Use single GPU for 8-bit, or use AWQ/GPTQ for multi-GPU quantized inference
`Intel Extension for PyTorch is not installed, but is required for xpu inference`	IPEX missing on XPU device	`pip install intel_extension_for_pytorch`
`Error: Failed to load Exllamav2`	ExLlamaV2 package not installed	`pip install exllamav2`
`Error: Failed to load GPTQ-for-LLaMa`	GPTQ repo not cloned	Clone `GPTQ-for-LLaMa` into `../repositories/`

Compatibility Notes

Multi-GPU: Uses `device_map="sequential"` (not `"auto"`) when GPU memory sizes differ, to correctly handle heterogeneous setups.
MPS (Apple Silicon): Requires transformers >= 4.35.0 to avoid in-place operation bugs; older versions get a monkey patch applied automatically.
XPU (Intel GPU): Uses bfloat16 dtype and requires `intel_extension_for_pytorch`. Model is optimized via `torch.xpu.optimize()`.
NPU (Huawei Ascend): Requires `torch_npu` package. Sets device via `torch_npu.npu.set_device("npu:0")`.
CPU offloading: Linux + CUDA only. Uses half of available system RAM (`psutil.virtual_memory().available / 2`).
vLLM/SGLang/LightLLM: These perform unguarded top-level imports, so they must be installed before the worker script can even be imported.
Docker: The `docker-compose.yml` reserves 1 NVIDIA GPU per model worker and mounts a named volume for the HuggingFace cache.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment