Workflow:Bigscience workshop Petals Server Contribution
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Distributed_Systems, GPU_Serving |
| Last Updated | 2026-02-09 13:00 GMT |
Overview
End-to-end process for contributing GPU resources to the Petals distributed network by running a server that hosts a contiguous range of transformer blocks for collaborative inference and training.
Description
This workflow covers how a volunteer sets up and runs a Petals server that hosts a subset of transformer blocks from a large language model. The server downloads and loads block weights, connects to the hivemind DHT to announce its availability, and handles incoming inference and training requests from clients via P2P gRPC streams. The server automatically selects which blocks to serve based on current swarm coverage, periodically rebalances to fill gaps, manages a shared KV cache for inference sessions, and supports optional features like quantization (INT8/NF4), tensor parallelism across multiple GPUs, and pre-loaded LoRA adapters.
Usage
Execute this workflow when you have one or more GPUs available and want to contribute compute capacity to the Petals swarm. This enables other users to run inference and training on large language models they could not host locally. You can serve models from the public swarm or set up a private swarm for controlled access.
Execution Steps
Step 1: Environment_Setup
Install PyTorch with CUDA support and the Petals library. For gated models (e.g., Llama), authenticate with HuggingFace. Optionally, use Docker for containerized deployment.
Key considerations:
- Match PyTorch CUDA version to your GPU driver
- Docker images are available at learningathome/petals:main
- On macOS with Apple Silicon, use CPU or MPS backend
- On Windows, use WSL2
Step 2: Server_Launch
Start the server using the CLI entry point (python -m petals.cli.run_server) with the model name as the primary argument. The CLI parses configuration including network ports, DHT initial peers, and hardware settings.
Key considerations:
- The model name must match a HuggingFace model repository (e.g., meta-llama/Meta-Llama-3.1-405B-Instruct)
- Default initial_peers connect to the public swarm bootstrap nodes
- Use --new_swarm to start a private swarm instead
- Use --port to specify a fixed listening port, or let the system choose a random free port
Step 3: Configuration_And_Resource_Estimation
The server loads the model configuration, detects available GPU memory, and estimates how many blocks can be served. Throughput is benchmarked automatically on the first run and cached for subsequent starts. Quantization type defaults to NF4 for GPU servers.
Key considerations:
- Use --num_blocks to override automatic block count selection
- Use --block_indices start:end to serve specific blocks
- Use --quant_type to choose between none, int8, or nf4 quantization
- Use --tensor_parallel_devices to split blocks across multiple GPUs
- Throughput measurement can be forced with --throughput eval
Step 4: DHT_Registration_And_Block_Selection
The server connects to the hivemind DHT, queries the network for existing block coverage, and uses a greedy algorithm to select the least-served contiguous block range. It announces its blocks and throughput to the DHT so clients can discover and route to it.
Key considerations:
- Block selection minimizes the bottleneck throughput across the model
- The server advertises its state (JOINING, ONLINE) and measured throughput
- DHT entries are refreshed periodically (default: every 120 seconds)
- A reachability check validates the server is accessible from the public internet
Step 5: Block_Loading_And_Serving
The server downloads and loads the transformer block weights for its assigned range. Each block is optionally quantized and wrapped for efficient execution. A TransformerConnectionHandler processes incoming P2P RPC requests for inference, forward, and backward operations. A MemoryCache manages shared KV cache allocation for concurrent inference sessions.
Key considerations:
- Blocks are cached on disk with LRU eviction to avoid re-downloading
- CUDA graphs are captured for single-token inference to minimize kernel launch overhead
- The handler supports concurrent requests via a PrioritizedTaskPool
- Pre-loaded LoRA adapters can be specified with --adapters for multi-adapter serving
Step 6: Health_Monitoring_And_Rebalancing
The server continuously monitors its health and the swarm's balance. If the swarm becomes imbalanced (some block ranges are underserved), the server can automatically shut down its current blocks, select a better range, and restart. The server also handles graceful shutdown on keyboard interrupt.
Key considerations:
- Balance quality threshold is configurable (default: 0.75)
- Rebalancing checks occur at random intervals around the mean_balance_check_period
- The server restarts its module container if a subprocess crashes
- GPU memory is explicitly cleaned between block changes