Implementation:Bigscience workshop Petals Server Init

Knowledge Sources	Petals
Domains	Distributed_Computing, Infrastructure, Resource_Management
Last Updated	2026-02-09 14:00 GMT

Overview

Concrete tool for initializing a Petals server with resource estimation, throughput benchmarking, and DHT connectivity, provided by the Petals server module.

Description

Server.__init__ performs comprehensive initialization:

DHT setup: Creates or connects to hivemind DHT with initial peers
Model config: Downloads model configuration from HuggingFace
Resource detection: Resolves dtype, quantization type, device placement
Block count: Auto-calculates from GPU memory if not specified via _choose_num_blocks()
Throughput: Benchmarks via get_server_throughput() (inference RPS, forward RPS, network RPS)
KV cache: Calculates attn_cache_bytes based on model architecture
Server info: Prepares ServerInfo and ModelInfo for DHT announcements

Usage

This is called automatically by main() in the CLI. Can also be instantiated programmatically for embedded server use. All parameters have sensible defaults except converted_model_name_or_path and initial_peers.

Code Reference

Source Location

Repository: petals
File: src/petals/server/server.py (L52-273)
File: src/petals/server/throughput.py (L37-108, get_server_throughput)
File: src/petals/server/block_utils.py (L12-53, resolve_block_dtype, get_block_size)

Signature

class Server:
    def __init__(
        self,
        *,
        initial_peers: List[str],
        dht_prefix: Optional[str],
        converted_model_name_or_path: str,
        public_name: Optional[str] = None,
        throughput: Union[float, str],
        num_blocks: Optional[int] = None,
        block_indices: Optional[str] = None,
        num_handlers: int = 8,
        inference_max_length: Optional[int] = None,
        min_batch_size: int = 1,
        max_batch_size: Optional[int] = None,
        max_chunk_size_bytes: int = 256 * 1024 * 1024,
        max_alloc_timeout: float = 600,
        attn_cache_tokens: Optional[int] = None,
        torch_dtype: str = "auto",
        revision: Optional[str] = None,
        cache_dir: Optional[str] = None,
        max_disk_space: Optional[int] = None,
        device: Optional[Union[str, torch.device]] = None,
        compression=CompressionType.NONE,
        stats_report_interval: Optional[int] = None,
        update_period: float = 60,
        expiration: Optional[float] = None,
        request_timeout: float = 3 * 60,
        session_timeout: float = 30 * 60,
        step_timeout: float = 5 * 60,
        balance_quality: float = 0.75,
        mean_balance_check_period: float = 120,
        mean_block_selection_delay: float = 5,
        token: Optional[Union[str, bool]] = None,
        quant_type: Optional[QuantType] = None,
        tensor_parallel_devices: Optional[Sequence[torch.device]] = None,
        skip_reachability_check: bool = False,
        adapters: Sequence[str] = (),
        **kwargs,
    ):
        """
        Initialize Petals server with all configuration.

        Args:
            initial_peers: DHT bootstrap peer addresses
            converted_model_name_or_path: HuggingFace model name
            throughput: "auto" for benchmarking, or float RPS
            num_blocks: Blocks to serve (auto if None)
            torch_dtype: "auto"/"float16"/"bfloat16"
            quant_type: Quantization (None=auto, "nf4", "int8", "none")
            balance_quality: Rebalancing threshold (0-1)
        """

Import

from petals.server.server import Server

I/O Contract

Inputs

Name	Type	Required	Description
initial_peers	List[str]	Yes	DHT bootstrap peer multiaddresses
converted_model_name_or_path	str	Yes	HuggingFace model name
throughput	Union[float, str]	Yes	"auto" for benchmarking or float RPS value
num_blocks	Optional[int]	No	Blocks to serve; auto-detected if None
torch_dtype	str	No	Weight dtype (default "auto")
quant_type	Optional[QuantType]	No	Quantization type (NF4 default on CUDA)

Outputs

Name	Type	Description
server	Server	Fully configured server ready to call server.run()
server.dht	DHT	Connected DHT instance
server.throughput_info	Dict[str, float]	Benchmarked throughput: inference_rps, forward_rps, network_rps
server.num_blocks	int	Determined number of blocks to serve

Usage Examples

Programmatic Server Creation

from petals.server.server import Server
from petals.constants import PUBLIC_INITIAL_PEERS

server = Server(
    initial_peers=PUBLIC_INITIAL_PEERS,
    dht_prefix=None,
    converted_model_name_or_path="petals-team/StableBeluga2",
    throughput="auto",
    num_blocks=None,  # Auto-detect from GPU memory
    torch_dtype="auto",
)

# Server is now configured with:
# - DHT connection established
# - Throughput benchmarked
# - Block count determined
# - KV cache sized

server.run()  # Start serving (blocks until shutdown)

Related Pages

Implements Principle

Principle:Bigscience_workshop_Petals_Server_Configuration

Requires Environment

Environment:Bigscience_workshop_Petals_CUDA_Server

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment