Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bigscience workshop Petals Server Init

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Infrastructure, Resource_Management
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for initializing a Petals server with resource estimation, throughput benchmarking, and DHT connectivity, provided by the Petals server module.

Description

Server.__init__ performs comprehensive initialization:

  1. DHT setup: Creates or connects to hivemind DHT with initial peers
  2. Model config: Downloads model configuration from HuggingFace
  3. Resource detection: Resolves dtype, quantization type, device placement
  4. Block count: Auto-calculates from GPU memory if not specified via _choose_num_blocks()
  5. Throughput: Benchmarks via get_server_throughput() (inference RPS, forward RPS, network RPS)
  6. KV cache: Calculates attn_cache_bytes based on model architecture
  7. Server info: Prepares ServerInfo and ModelInfo for DHT announcements

Usage

This is called automatically by main() in the CLI. Can also be instantiated programmatically for embedded server use. All parameters have sensible defaults except converted_model_name_or_path and initial_peers.

Code Reference

Source Location

  • Repository: petals
  • File: src/petals/server/server.py (L52-273)
  • File: src/petals/server/throughput.py (L37-108, get_server_throughput)
  • File: src/petals/server/block_utils.py (L12-53, resolve_block_dtype, get_block_size)

Signature

class Server:
    def __init__(
        self,
        *,
        initial_peers: List[str],
        dht_prefix: Optional[str],
        converted_model_name_or_path: str,
        public_name: Optional[str] = None,
        throughput: Union[float, str],
        num_blocks: Optional[int] = None,
        block_indices: Optional[str] = None,
        num_handlers: int = 8,
        inference_max_length: Optional[int] = None,
        min_batch_size: int = 1,
        max_batch_size: Optional[int] = None,
        max_chunk_size_bytes: int = 256 * 1024 * 1024,
        max_alloc_timeout: float = 600,
        attn_cache_tokens: Optional[int] = None,
        torch_dtype: str = "auto",
        revision: Optional[str] = None,
        cache_dir: Optional[str] = None,
        max_disk_space: Optional[int] = None,
        device: Optional[Union[str, torch.device]] = None,
        compression=CompressionType.NONE,
        stats_report_interval: Optional[int] = None,
        update_period: float = 60,
        expiration: Optional[float] = None,
        request_timeout: float = 3 * 60,
        session_timeout: float = 30 * 60,
        step_timeout: float = 5 * 60,
        balance_quality: float = 0.75,
        mean_balance_check_period: float = 120,
        mean_block_selection_delay: float = 5,
        token: Optional[Union[str, bool]] = None,
        quant_type: Optional[QuantType] = None,
        tensor_parallel_devices: Optional[Sequence[torch.device]] = None,
        skip_reachability_check: bool = False,
        adapters: Sequence[str] = (),
        **kwargs,
    ):
        """
        Initialize Petals server with all configuration.

        Args:
            initial_peers: DHT bootstrap peer addresses
            converted_model_name_or_path: HuggingFace model name
            throughput: "auto" for benchmarking, or float RPS
            num_blocks: Blocks to serve (auto if None)
            torch_dtype: "auto"/"float16"/"bfloat16"
            quant_type: Quantization (None=auto, "nf4", "int8", "none")
            balance_quality: Rebalancing threshold (0-1)
        """

Import

from petals.server.server import Server

I/O Contract

Inputs

Name Type Required Description
initial_peers List[str] Yes DHT bootstrap peer multiaddresses
converted_model_name_or_path str Yes HuggingFace model name
throughput Union[float, str] Yes "auto" for benchmarking or float RPS value
num_blocks Optional[int] No Blocks to serve; auto-detected if None
torch_dtype str No Weight dtype (default "auto")
quant_type Optional[QuantType] No Quantization type (NF4 default on CUDA)

Outputs

Name Type Description
server Server Fully configured server ready to call server.run()
server.dht DHT Connected DHT instance
server.throughput_info Dict[str, float] Benchmarked throughput: inference_rps, forward_rps, network_rps
server.num_blocks int Determined number of blocks to serve

Usage Examples

Programmatic Server Creation

from petals.server.server import Server
from petals.constants import PUBLIC_INITIAL_PEERS

server = Server(
    initial_peers=PUBLIC_INITIAL_PEERS,
    dht_prefix=None,
    converted_model_name_or_path="petals-team/StableBeluga2",
    throughput="auto",
    num_blocks=None,  # Auto-detect from GPU memory
    torch_dtype="auto",
)

# Server is now configured with:
# - DHT connection established
# - Throughput benchmarked
# - Block count determined
# - KV cache sized

server.run()  # Start serving (blocks until shutdown)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment