Implementation:Bigscience workshop Petals Run Server Main

Knowledge Sources	Petals
Domains	Distributed_Computing, Infrastructure, CLI
Last Updated	2026-02-09 14:00 GMT

Overview

Concrete tool for launching a Petals server from the command line, provided by the Petals CLI module.

Description

petals.cli.run_server.main() is the CLI entry point that parses arguments and creates a Server instance. It uses configargparse for argument parsing (supporting both CLI flags and config files).

Key argument groups:

Model: converted_model_name_or_path (positional), --token
Resources: --num_blocks, --block_indices, --torch_dtype, --quant_type, --tensor_parallel_devices
Network: --port, --public_ip, --initial_peers, --new_swarm
Performance: --throughput (auto/eval/dry_run/float), --num_handlers, --max_batch_size
Timeouts: --request_timeout, --session_timeout, --step_timeout

Usage

Invoke via python -m petals.cli.run_server MODEL_NAME or docker run learningathome/petals:main .... The function blocks until the server is shut down (via KeyboardInterrupt or signal).

Code Reference

Source Location

Repository: petals
File: src/petals/cli/run_server.py (L19-235)

Signature

def main():
    """
    CLI entry point for launching a Petals server.

    Key arguments (via argparse):
        converted_model_name_or_path (str): HF model repo name (positional)
        --num_blocks (int): Number of transformer blocks to serve
        --block_indices (str): Specific block range, e.g. "0:18"
        --port (int): Listening port
        --public_ip (str): Public IPv4 address
        --initial_peers (List[str]): DHT bootstrap peers
        --new_swarm (bool): Start a private swarm
        --throughput (str|float): "auto"/"eval"/"dry_run" or RPS float
        --torch_dtype (str): "auto"/"float16"/"float32"/"bfloat16"
        --quant_type (str): "none"/"int8"/"nf4"
        --tensor_parallel_devices (List[str]): Multi-GPU device list
        --num_handlers (int): P2P handler processes (default 8)
        --adapters (List[str]): LoRA adapters to pre-load
    """

Import

# CLI invocation:
# python -m petals.cli.run_server petals-team/StableBeluga2
# Or programmatically:
from petals.cli.run_server import main
main()

I/O Contract

Inputs

Name	Type	Required	Description
model_name	str	Yes	HuggingFace model repository name (positional argument)
--num_blocks	int	No	Number of blocks to serve (auto-detected if not specified)
--initial_peers	List[str]	No	DHT bootstrap peers (defaults to public swarm)
--throughput	str or float	No	Throughput mode: "auto" benchmarks, "eval" evaluates, float sets directly
--torch_dtype	str	No	Weight data type (default "auto")
--quant_type	str	No	Quantization type: "none", "int8", or "nf4"

Outputs

Name	Type	Description
server	Server	Running server instance (blocks until shutdown)
DHT announcements	dict	Server blocks announced as ONLINE in the hivemind DHT

Usage Examples

Basic Server Launch

# Serve blocks from StableBeluga2 (auto-detects GPU memory and block count)
python -m petals.cli.run_server petals-team/StableBeluga2

# Serve specific blocks with NF4 quantization
python -m petals.cli.run_server petals-team/StableBeluga2 \
    --block_indices 0:18 \
    --quant_type nf4

# Multi-GPU tensor parallelism
python -m petals.cli.run_server petals-team/StableBeluga2 \
    --tensor_parallel_devices cuda:0 cuda:1

Docker Launch

docker run -p 31330:31330 --ipc host --gpus all \
    --volume petals-cache:/cache --rm \
    learningathome/petals:main \
    python -m petals.cli.run_server petals-team/StableBeluga2

Related Pages

Implements Principle

Principle:Bigscience_workshop_Petals_Server_CLI_Launch

Requires Environment

Environment:Bigscience_workshop_Petals_CUDA_Server

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment