Environment:Bigscience workshop Petals CUDA Server

Knowledge Sources	Petals setup.cfg
Domains	Infrastructure, GPU_Computing
Last Updated	2026-02-09 13:00 GMT

Overview

NVIDIA CUDA GPU server environment with PyTorch, hivemind, bitsandbytes, and speedtest-cli for running Petals server nodes that host transformer blocks.

Description

This environment provides the full stack required for operating a Petals server node. It extends the Python_Hivemind environment with GPU-specific requirements: NVIDIA CUDA drivers and toolkit for GPU computation, bitsandbytes for INT8/NF4 quantization, speedtest-cli for network throughput measurement, and psutil for system resource monitoring. The server auto-detects available hardware (CUDA > MPS > CPU) and selects the appropriate dtype and quantization strategy. Tensor parallelism across multiple GPUs is supported for CUDA devices.

Usage

Required for all server-side operations: launching a server via CLI, server initialization and resource estimation, block selection, block loading/quantization/serving, health monitoring, and rebalancing. Without a CUDA GPU, the server can still run on MPS (macOS) or CPU but with significant performance limitations.

System Requirements

Category	Requirement	Notes
OS	Linux (recommended), macOS	Linux for CUDA; macOS for MPS fallback
Hardware	NVIDIA GPU (CUDA)	Minimum VRAM to serve at least 1 transformer block
VRAM	Varies by model	16GB+ recommended for 7B models with NF4 quantization
Network	Public internet access	For P2P swarm, DHT, and reachability checks
Disk	SSD recommended	For model weight caching with LRU eviction
Python	>= 3.8	Supports 3.8, 3.9, 3.10, 3.11

Dependencies

System Packages

NVIDIA CUDA drivers (for GPU support)
`fcntl` (Linux; used for file locking in throughput cache)

Python Packages

All Python_Hivemind packages, plus:

`speedtest-cli` == 2.1.3 (network throughput measurement)
`bitsandbytes` == 0.41.1 (INT8/NF4 quantization)
`tensor_parallel` == 1.0.23 (multi-GPU tensor parallelism)
`psutil` (system memory monitoring)
`cpufeature` >= 0.2.0 (x86_64 only)
`peft` == 0.8.2 (LoRA adapter support)
`accelerate` >= 0.27.2 (model loading utilities)

Credentials

`HF_TOKEN` (optional): Required for serving gated models (e.g., Llama-2).

Quick Install

pip install petals
# speedtest-cli is included as a dependency
# CUDA drivers must be installed separately via system package manager

Code Evidence

Device auto-detection from `src/petals/server/server.py:160-170`:

if device is None:
    if torch.cuda.is_available():
        device = "cuda"
    elif torch.backends.mps.is_available():
        device = "mps"
    else:
        device = "cpu"
device = torch.device(device)
if device.type == "cuda" and device.index is None:
    device = torch.device(device.type, index=0)

GPU requirement for automatic block count from `src/petals/server/server.py:276-279`:

assert self.device.type in ("cuda", "mps"), (
    "GPU is not available. If you want to run a CPU-only server, please specify --num_blocks. "
    "CPU-only servers in the public swarm are discouraged since they are much slower"
)

float16 not supported on CPU from `src/petals/server/server.py:173-176`:

if device.type == "cpu" and torch_dtype == torch.float16:
    raise ValueError(
        f"Type float16 is not supported on CPU. Please use --torch_dtype float32 or --torch_dtype bfloat16"
    )

bfloat16 not supported on MPS from `src/petals/server/server.py:177-179`:

if device.type == "mps" and torch_dtype == torch.bfloat16:
    logger.warning(f"Type bfloat16 is not supported on MPS, using float16 instead")
    torch_dtype = torch.float16

speedtest-cli requirement from `src/petals/server/throughput.py:23-34`:

try:
    import speedtest
except ImportError:
    raise ImportError("Please `pip install speedtest-cli==2.1.3`")

if not hasattr(speedtest, "Speedtest"):
    raise ImportError(
        "You are using the wrong speedtest module. Please replace speedtest with speedtest-cli.\n"
        "To do that, run `pip uninstall -y speedtest`. ..."
    )

Tensor parallelism GPU balance check from `src/petals/server/server.py:288-293`:

if max(memory_per_device) / min(memory_per_device) > 1.5:
    raise ValueError(
        "GPU devices have highly uneven memory, which makes tensor parallelism inefficient. "
        "Please launch individual servers on each GPU or set --num_blocks manually"
    )

Common Errors

Error Message	Cause	Solution
`ImportError: Please pip install speedtest-cli==2.1.3`	speedtest-cli not installed	`pip install speedtest-cli==2.1.3`
`You are using the wrong speedtest module`	Wrong `speedtest` package installed	`pip uninstall -y speedtest && pip install speedtest-cli==2.1.3`
`ValueError: Type float16 is not supported on CPU`	Running on CPU with float16 dtype	Use `--torch_dtype float32` or `--torch_dtype bfloat16`
`AssertionError: GPU is not available`	No GPU detected, --num_blocks not specified	Specify `--num_blocks` manually for CPU servers
`AssertionError: Your GPU does not have enough memory to serve at least one block`	GPU VRAM too small for model	Use a larger GPU or try a smaller model
`GPU devices have highly uneven memory`	Tensor parallelism with mismatched GPUs (>1.5x ratio)	Run separate servers per GPU or set `--num_blocks` manually

Compatibility Notes

CUDA (NVIDIA): Full support including NF4/INT8 quantization, tensor parallelism, and CUDA graphs. NF4 quantization is the default on CUDA devices.
MPS (Apple Silicon): Supported as fallback. bfloat16 automatically downgraded to float16. No quantization support. No tensor parallelism.
CPU: Functional but discouraged for public swarm. float16 not supported; use float32 or bfloat16. Must specify `--num_blocks` manually.
Tensor Parallelism: Only tested for BLOOM model. GPUs must have similar memory (within 1.5x ratio). Warning issued for GPUs with different compute capabilities.
Multi-GPU: Tensor parallel only on CUDA. 5% memory waste threshold triggers a warning.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment