Workflow:Bigscience workshop Petals Server Contribution

Knowledge Sources	Petals Petals: Collaborative Inference and Fine-tuning Petals Health Monitor
Domains	LLMs, Distributed_Systems, GPU_Serving
Last Updated	2026-02-09 13:00 GMT

Overview

End-to-end process for contributing GPU resources to the Petals distributed network by running a server that hosts a contiguous range of transformer blocks for collaborative inference and training.

Description

This workflow covers how a volunteer sets up and runs a Petals server that hosts a subset of transformer blocks from a large language model. The server downloads and loads block weights, connects to the hivemind DHT to announce its availability, and handles incoming inference and training requests from clients via P2P gRPC streams. The server automatically selects which blocks to serve based on current swarm coverage, periodically rebalances to fill gaps, manages a shared KV cache for inference sessions, and supports optional features like quantization (INT8/NF4), tensor parallelism across multiple GPUs, and pre-loaded LoRA adapters.

Usage

Execute this workflow when you have one or more GPUs available and want to contribute compute capacity to the Petals swarm. This enables other users to run inference and training on large language models they could not host locally. You can serve models from the public swarm or set up a private swarm for controlled access.

Execution Steps

Step 1: Environment_Setup

Install PyTorch with CUDA support and the Petals library. For gated models (e.g., Llama), authenticate with HuggingFace. Optionally, use Docker for containerized deployment.

Key considerations:

Match PyTorch CUDA version to your GPU driver
Docker images are available at learningathome/petals:main
On macOS with Apple Silicon, use CPU or MPS backend
On Windows, use WSL2

Step 2: Server_Launch

Start the server using the CLI entry point (python -m petals.cli.run_server) with the model name as the primary argument. The CLI parses configuration including network ports, DHT initial peers, and hardware settings.

Key considerations:

The model name must match a HuggingFace model repository (e.g., meta-llama/Meta-Llama-3.1-405B-Instruct)
Default initial_peers connect to the public swarm bootstrap nodes
Use --new_swarm to start a private swarm instead
Use --port to specify a fixed listening port, or let the system choose a random free port

Step 3: Configuration_And_Resource_Estimation

The server loads the model configuration, detects available GPU memory, and estimates how many blocks can be served. Throughput is benchmarked automatically on the first run and cached for subsequent starts. Quantization type defaults to NF4 for GPU servers.

Key considerations:

Use --num_blocks to override automatic block count selection
Use --block_indices start:end to serve specific blocks
Use --quant_type to choose between none, int8, or nf4 quantization
Use --tensor_parallel_devices to split blocks across multiple GPUs
Throughput measurement can be forced with --throughput eval

Step 4: DHT_Registration_And_Block_Selection

The server connects to the hivemind DHT, queries the network for existing block coverage, and uses a greedy algorithm to select the least-served contiguous block range. It announces its blocks and throughput to the DHT so clients can discover and route to it.

Key considerations:

Block selection minimizes the bottleneck throughput across the model
The server advertises its state (JOINING, ONLINE) and measured throughput
DHT entries are refreshed periodically (default: every 120 seconds)
A reachability check validates the server is accessible from the public internet

Step 5: Block_Loading_And_Serving

The server downloads and loads the transformer block weights for its assigned range. Each block is optionally quantized and wrapped for efficient execution. A TransformerConnectionHandler processes incoming P2P RPC requests for inference, forward, and backward operations. A MemoryCache manages shared KV cache allocation for concurrent inference sessions.

Key considerations:

Blocks are cached on disk with LRU eviction to avoid re-downloading
CUDA graphs are captured for single-token inference to minimize kernel launch overhead
The handler supports concurrent requests via a PrioritizedTaskPool
Pre-loaded LoRA adapters can be specified with --adapters for multi-adapter serving

Step 6: Health_Monitoring_And_Rebalancing

The server continuously monitors its health and the swarm's balance. If the swarm becomes imbalanced (some block ranges are underserved), the server can automatically shut down its current blocks, select a better range, and restart. The server also handles graceful shutdown on keyboard interrupt.

Key considerations:

Balance quality threshold is configurable (default: 0.75)
Rebalancing checks occur at random intervals around the mean_balance_check_period
The server restarts its module container if a subprocess crashes
GPU memory is explicitly cleaned between block changes

Execution Diagram

GitHub URL

Workflow Repository