Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Bigscience workshop Petals Run Server Main

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Infrastructure, CLI
Last Updated 2026-02-09 14:00 GMT

Overview

Concrete tool for launching a Petals server from the command line, provided by the Petals CLI module.

Description

petals.cli.run_server.main() is the CLI entry point that parses arguments and creates a Server instance. It uses configargparse for argument parsing (supporting both CLI flags and config files).

Key argument groups:

  • Model: converted_model_name_or_path (positional), --token
  • Resources: --num_blocks, --block_indices, --torch_dtype, --quant_type, --tensor_parallel_devices
  • Network: --port, --public_ip, --initial_peers, --new_swarm
  • Performance: --throughput (auto/eval/dry_run/float), --num_handlers, --max_batch_size
  • Timeouts: --request_timeout, --session_timeout, --step_timeout

Usage

Invoke via python -m petals.cli.run_server MODEL_NAME or docker run learningathome/petals:main .... The function blocks until the server is shut down (via KeyboardInterrupt or signal).

Code Reference

Source Location

  • Repository: petals
  • File: src/petals/cli/run_server.py (L19-235)

Signature

def main():
    """
    CLI entry point for launching a Petals server.

    Key arguments (via argparse):
        converted_model_name_or_path (str): HF model repo name (positional)
        --num_blocks (int): Number of transformer blocks to serve
        --block_indices (str): Specific block range, e.g. "0:18"
        --port (int): Listening port
        --public_ip (str): Public IPv4 address
        --initial_peers (List[str]): DHT bootstrap peers
        --new_swarm (bool): Start a private swarm
        --throughput (str|float): "auto"/"eval"/"dry_run" or RPS float
        --torch_dtype (str): "auto"/"float16"/"float32"/"bfloat16"
        --quant_type (str): "none"/"int8"/"nf4"
        --tensor_parallel_devices (List[str]): Multi-GPU device list
        --num_handlers (int): P2P handler processes (default 8)
        --adapters (List[str]): LoRA adapters to pre-load
    """

Import

# CLI invocation:
# python -m petals.cli.run_server petals-team/StableBeluga2
# Or programmatically:
from petals.cli.run_server import main
main()

I/O Contract

Inputs

Name Type Required Description
model_name str Yes HuggingFace model repository name (positional argument)
--num_blocks int No Number of blocks to serve (auto-detected if not specified)
--initial_peers List[str] No DHT bootstrap peers (defaults to public swarm)
--throughput str or float No Throughput mode: "auto" benchmarks, "eval" evaluates, float sets directly
--torch_dtype str No Weight data type (default "auto")
--quant_type str No Quantization type: "none", "int8", or "nf4"

Outputs

Name Type Description
server Server Running server instance (blocks until shutdown)
DHT announcements dict Server blocks announced as ONLINE in the hivemind DHT

Usage Examples

Basic Server Launch

# Serve blocks from StableBeluga2 (auto-detects GPU memory and block count)
python -m petals.cli.run_server petals-team/StableBeluga2

# Serve specific blocks with NF4 quantization
python -m petals.cli.run_server petals-team/StableBeluga2 \
    --block_indices 0:18 \
    --quant_type nf4

# Multi-GPU tensor parallelism
python -m petals.cli.run_server petals-team/StableBeluga2 \
    --tensor_parallel_devices cuda:0 cuda:1

Docker Launch

docker run -p 31330:31330 --ipc host --gpus all \
    --volume petals-cache:/cache --rm \
    learningathome/petals:main \
    python -m petals.cli.run_server petals-team/StableBeluga2

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment