Principle:Bigscience workshop Petals Server Configuration
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Infrastructure, Resource_Management |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
The process of configuring a Petals server by determining GPU resources, estimating throughput, resolving data types, and establishing DHT connectivity before serving transformer blocks.
Description
Server Configuration encompasses the initialization phase where a Petals server determines its operational parameters:
Resource estimation:
- Block count: If not specified, automatically calculated from available GPU memory by estimating per-block memory requirements (weights + KV cache)
- Throughput: Benchmarked by running sample inference/forward passes and measuring network speed (via speedtest-cli)
- Data type: Resolved from model config (float16/bfloat16) based on hardware capabilities
- Quantization: NF4 is default on CUDA for memory efficiency; INT8 available for better quality
Network setup:
- DHT connection: Connects to the hivemind Kademlia DHT using initial peers
- Reachability check: Validates that the server is reachable from the network
- Port configuration: Sets up P2P listening endpoints
KV cache sizing:
- Calculates attn_cache_bytes based on model architecture (MQA vs MHA)
- Default: 16384 tokens for Multi-Query Attention, 4096 for standard MHA
Usage
This principle is applied automatically during Server.__init__. Server operators can override auto-detected values via CLI flags for fine-grained control. Understanding this process helps operators optimize their server contribution.
Theoretical Basis
Memory budget calculation:
# Abstract resource estimation
block_memory = get_block_size(config, dtype, quant_type) # Per-block weight memory
cache_per_token = 2 * num_layers * hidden_size * dtype_size # KV cache per token
total_cache = cache_per_token * attn_cache_tokens
available_memory = torch.cuda.get_device_properties(0).total_mem
usable_memory = available_memory * 0.9 # Safety margin
if num_blocks is None:
num_blocks = (usable_memory - total_cache) // block_memory
Throughput estimation:
- Inference RPS: Single-token forward pass throughput
- Forward RPS: Multi-token forward pass throughput
- Network RPS: Hidden state transfer rate (bandwidth / hidden_state_size)
- Effective throughput: min(compute_rps, network_rps)