Environment:Bigscience workshop Petals CUDA Server
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing |
| Last Updated | 2026-02-09 13:00 GMT |
Overview
NVIDIA CUDA GPU server environment with PyTorch, hivemind, bitsandbytes, and speedtest-cli for running Petals server nodes that host transformer blocks.
Description
This environment provides the full stack required for operating a Petals server node. It extends the Python_Hivemind environment with GPU-specific requirements: NVIDIA CUDA drivers and toolkit for GPU computation, bitsandbytes for INT8/NF4 quantization, speedtest-cli for network throughput measurement, and psutil for system resource monitoring. The server auto-detects available hardware (CUDA > MPS > CPU) and selects the appropriate dtype and quantization strategy. Tensor parallelism across multiple GPUs is supported for CUDA devices.
Usage
Required for all server-side operations: launching a server via CLI, server initialization and resource estimation, block selection, block loading/quantization/serving, health monitoring, and rebalancing. Without a CUDA GPU, the server can still run on MPS (macOS) or CPU but with significant performance limitations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended), macOS | Linux for CUDA; macOS for MPS fallback |
| Hardware | NVIDIA GPU (CUDA) | Minimum VRAM to serve at least 1 transformer block |
| VRAM | Varies by model | 16GB+ recommended for 7B models with NF4 quantization |
| Network | Public internet access | For P2P swarm, DHT, and reachability checks |
| Disk | SSD recommended | For model weight caching with LRU eviction |
| Python | >= 3.8 | Supports 3.8, 3.9, 3.10, 3.11 |
Dependencies
System Packages
- NVIDIA CUDA drivers (for GPU support)
- `fcntl` (Linux; used for file locking in throughput cache)
Python Packages
All Python_Hivemind packages, plus:
- `speedtest-cli` == 2.1.3 (network throughput measurement)
- `bitsandbytes` == 0.41.1 (INT8/NF4 quantization)
- `tensor_parallel` == 1.0.23 (multi-GPU tensor parallelism)
- `psutil` (system memory monitoring)
- `cpufeature` >= 0.2.0 (x86_64 only)
- `peft` == 0.8.2 (LoRA adapter support)
- `accelerate` >= 0.27.2 (model loading utilities)
Credentials
- `HF_TOKEN` (optional): Required for serving gated models (e.g., Llama-2).
Quick Install
pip install petals
# speedtest-cli is included as a dependency
# CUDA drivers must be installed separately via system package manager
Code Evidence
Device auto-detection from `src/petals/server/server.py:160-170`:
if device is None:
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
else:
device = "cpu"
device = torch.device(device)
if device.type == "cuda" and device.index is None:
device = torch.device(device.type, index=0)
GPU requirement for automatic block count from `src/petals/server/server.py:276-279`:
assert self.device.type in ("cuda", "mps"), (
"GPU is not available. If you want to run a CPU-only server, please specify --num_blocks. "
"CPU-only servers in the public swarm are discouraged since they are much slower"
)
float16 not supported on CPU from `src/petals/server/server.py:173-176`:
if device.type == "cpu" and torch_dtype == torch.float16:
raise ValueError(
f"Type float16 is not supported on CPU. Please use --torch_dtype float32 or --torch_dtype bfloat16"
)
bfloat16 not supported on MPS from `src/petals/server/server.py:177-179`:
if device.type == "mps" and torch_dtype == torch.bfloat16:
logger.warning(f"Type bfloat16 is not supported on MPS, using float16 instead")
torch_dtype = torch.float16
speedtest-cli requirement from `src/petals/server/throughput.py:23-34`:
try:
import speedtest
except ImportError:
raise ImportError("Please `pip install speedtest-cli==2.1.3`")
if not hasattr(speedtest, "Speedtest"):
raise ImportError(
"You are using the wrong speedtest module. Please replace speedtest with speedtest-cli.\n"
"To do that, run `pip uninstall -y speedtest`. ..."
)
Tensor parallelism GPU balance check from `src/petals/server/server.py:288-293`:
if max(memory_per_device) / min(memory_per_device) > 1.5:
raise ValueError(
"GPU devices have highly uneven memory, which makes tensor parallelism inefficient. "
"Please launch individual servers on each GPU or set --num_blocks manually"
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: Please pip install speedtest-cli==2.1.3` | speedtest-cli not installed | `pip install speedtest-cli==2.1.3` |
| `You are using the wrong speedtest module` | Wrong `speedtest` package installed | `pip uninstall -y speedtest && pip install speedtest-cli==2.1.3` |
| `ValueError: Type float16 is not supported on CPU` | Running on CPU with float16 dtype | Use `--torch_dtype float32` or `--torch_dtype bfloat16` |
| `AssertionError: GPU is not available` | No GPU detected, --num_blocks not specified | Specify `--num_blocks` manually for CPU servers |
| `AssertionError: Your GPU does not have enough memory to serve at least one block` | GPU VRAM too small for model | Use a larger GPU or try a smaller model |
| `GPU devices have highly uneven memory` | Tensor parallelism with mismatched GPUs (>1.5x ratio) | Run separate servers per GPU or set `--num_blocks` manually |
Compatibility Notes
- CUDA (NVIDIA): Full support including NF4/INT8 quantization, tensor parallelism, and CUDA graphs. NF4 quantization is the default on CUDA devices.
- MPS (Apple Silicon): Supported as fallback. bfloat16 automatically downgraded to float16. No quantization support. No tensor parallelism.
- CPU: Functional but discouraged for public swarm. float16 not supported; use float32 or bfloat16. Must specify `--num_blocks` manually.
- Tensor Parallelism: Only tested for BLOOM model. GPUs must have similar memory (within 1.5x ratio). Warning issued for GPUs with different compute capabilities.
- Multi-GPU: Tensor parallel only on CUDA. 5% memory waste threshold triggers a warning.