Implementation:Bigscience workshop Petals Run Server Main
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Infrastructure, CLI |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Concrete tool for launching a Petals server from the command line, provided by the Petals CLI module.
Description
petals.cli.run_server.main() is the CLI entry point that parses arguments and creates a Server instance. It uses configargparse for argument parsing (supporting both CLI flags and config files).
Key argument groups:
- Model: converted_model_name_or_path (positional), --token
- Resources: --num_blocks, --block_indices, --torch_dtype, --quant_type, --tensor_parallel_devices
- Network: --port, --public_ip, --initial_peers, --new_swarm
- Performance: --throughput (auto/eval/dry_run/float), --num_handlers, --max_batch_size
- Timeouts: --request_timeout, --session_timeout, --step_timeout
Usage
Invoke via python -m petals.cli.run_server MODEL_NAME or docker run learningathome/petals:main .... The function blocks until the server is shut down (via KeyboardInterrupt or signal).
Code Reference
Source Location
- Repository: petals
- File: src/petals/cli/run_server.py (L19-235)
Signature
def main():
"""
CLI entry point for launching a Petals server.
Key arguments (via argparse):
converted_model_name_or_path (str): HF model repo name (positional)
--num_blocks (int): Number of transformer blocks to serve
--block_indices (str): Specific block range, e.g. "0:18"
--port (int): Listening port
--public_ip (str): Public IPv4 address
--initial_peers (List[str]): DHT bootstrap peers
--new_swarm (bool): Start a private swarm
--throughput (str|float): "auto"/"eval"/"dry_run" or RPS float
--torch_dtype (str): "auto"/"float16"/"float32"/"bfloat16"
--quant_type (str): "none"/"int8"/"nf4"
--tensor_parallel_devices (List[str]): Multi-GPU device list
--num_handlers (int): P2P handler processes (default 8)
--adapters (List[str]): LoRA adapters to pre-load
"""
Import
# CLI invocation:
# python -m petals.cli.run_server petals-team/StableBeluga2
# Or programmatically:
from petals.cli.run_server import main
main()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | Yes | HuggingFace model repository name (positional argument) |
| --num_blocks | int | No | Number of blocks to serve (auto-detected if not specified) |
| --initial_peers | List[str] | No | DHT bootstrap peers (defaults to public swarm) |
| --throughput | str or float | No | Throughput mode: "auto" benchmarks, "eval" evaluates, float sets directly |
| --torch_dtype | str | No | Weight data type (default "auto") |
| --quant_type | str | No | Quantization type: "none", "int8", or "nf4" |
Outputs
| Name | Type | Description |
|---|---|---|
| server | Server | Running server instance (blocks until shutdown) |
| DHT announcements | dict | Server blocks announced as ONLINE in the hivemind DHT |
Usage Examples
Basic Server Launch
# Serve blocks from StableBeluga2 (auto-detects GPU memory and block count)
python -m petals.cli.run_server petals-team/StableBeluga2
# Serve specific blocks with NF4 quantization
python -m petals.cli.run_server petals-team/StableBeluga2 \
--block_indices 0:18 \
--quant_type nf4
# Multi-GPU tensor parallelism
python -m petals.cli.run_server petals-team/StableBeluga2 \
--tensor_parallel_devices cuda:0 cuda:1
Docker Launch
docker run -p 31330:31330 --ipc host --gpus all \
--volume petals-cache:/cache --rm \
learningathome/petals:main \
python -m petals.cli.run_server petals-team/StableBeluga2
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment