Implementation:Bigscience workshop Petals Server Init
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Infrastructure, Resource_Management |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for initializing a Petals server with resource estimation, throughput benchmarking, and DHT connectivity, provided by the Petals server module.
Description
Server.__init__ performs comprehensive initialization:
- DHT setup: Creates or connects to hivemind DHT with initial peers
- Model config: Downloads model configuration from HuggingFace
- Resource detection: Resolves dtype, quantization type, device placement
- Block count: Auto-calculates from GPU memory if not specified via _choose_num_blocks()
- Throughput: Benchmarks via get_server_throughput() (inference RPS, forward RPS, network RPS)
- KV cache: Calculates attn_cache_bytes based on model architecture
- Server info: Prepares ServerInfo and ModelInfo for DHT announcements
Usage
This is called automatically by main() in the CLI. Can also be instantiated programmatically for embedded server use. All parameters have sensible defaults except converted_model_name_or_path and initial_peers.
Code Reference
Source Location
- Repository: petals
- File: src/petals/server/server.py (L52-273)
- File: src/petals/server/throughput.py (L37-108, get_server_throughput)
- File: src/petals/server/block_utils.py (L12-53, resolve_block_dtype, get_block_size)
Signature
class Server:
def __init__(
self,
*,
initial_peers: List[str],
dht_prefix: Optional[str],
converted_model_name_or_path: str,
public_name: Optional[str] = None,
throughput: Union[float, str],
num_blocks: Optional[int] = None,
block_indices: Optional[str] = None,
num_handlers: int = 8,
inference_max_length: Optional[int] = None,
min_batch_size: int = 1,
max_batch_size: Optional[int] = None,
max_chunk_size_bytes: int = 256 * 1024 * 1024,
max_alloc_timeout: float = 600,
attn_cache_tokens: Optional[int] = None,
torch_dtype: str = "auto",
revision: Optional[str] = None,
cache_dir: Optional[str] = None,
max_disk_space: Optional[int] = None,
device: Optional[Union[str, torch.device]] = None,
compression=CompressionType.NONE,
stats_report_interval: Optional[int] = None,
update_period: float = 60,
expiration: Optional[float] = None,
request_timeout: float = 3 * 60,
session_timeout: float = 30 * 60,
step_timeout: float = 5 * 60,
balance_quality: float = 0.75,
mean_balance_check_period: float = 120,
mean_block_selection_delay: float = 5,
token: Optional[Union[str, bool]] = None,
quant_type: Optional[QuantType] = None,
tensor_parallel_devices: Optional[Sequence[torch.device]] = None,
skip_reachability_check: bool = False,
adapters: Sequence[str] = (),
**kwargs,
):
"""
Initialize Petals server with all configuration.
Args:
initial_peers: DHT bootstrap peer addresses
converted_model_name_or_path: HuggingFace model name
throughput: "auto" for benchmarking, or float RPS
num_blocks: Blocks to serve (auto if None)
torch_dtype: "auto"/"float16"/"bfloat16"
quant_type: Quantization (None=auto, "nf4", "int8", "none")
balance_quality: Rebalancing threshold (0-1)
"""
Import
from petals.server.server import Server
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| initial_peers | List[str] | Yes | DHT bootstrap peer multiaddresses |
| converted_model_name_or_path | str | Yes | HuggingFace model name |
| throughput | Union[float, str] | Yes | "auto" for benchmarking or float RPS value |
| num_blocks | Optional[int] | No | Blocks to serve; auto-detected if None |
| torch_dtype | str | No | Weight dtype (default "auto") |
| quant_type | Optional[QuantType] | No | Quantization type (NF4 default on CUDA) |
Outputs
| Name | Type | Description |
|---|---|---|
| server | Server | Fully configured server ready to call server.run() |
| server.dht | DHT | Connected DHT instance |
| server.throughput_info | Dict[str, float] | Benchmarked throughput: inference_rps, forward_rps, network_rps |
| server.num_blocks | int | Determined number of blocks to serve |
Usage Examples
Programmatic Server Creation
from petals.server.server import Server
from petals.constants import PUBLIC_INITIAL_PEERS
server = Server(
initial_peers=PUBLIC_INITIAL_PEERS,
dht_prefix=None,
converted_model_name_or_path="petals-team/StableBeluga2",
throughput="auto",
num_blocks=None, # Auto-detect from GPU memory
torch_dtype="auto",
)
# Server is now configured with:
# - DHT connection established
# - Throughput benchmarked
# - Block count determined
# - KV cache sized
server.run() # Start serving (blocks until shutdown)
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment