Principle:Bigscience workshop Petals Health Monitoring
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Infrastructure, Monitoring |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
A continuous monitoring loop that checks server health, evaluates swarm balance, and triggers automatic rebalancing when the network's block coverage becomes suboptimal.
Description
Health Monitoring ensures the long-term stability of both individual servers and the overall Petals swarm. The monitoring loop runs in Server.run() and performs:
Container health checks:
- Verifies that the ModuleContainer thread is alive and responsive
- Detects crashes or OOM failures in block serving
Swarm balance evaluation:
- Periodically queries DHT for current block coverage
- Calls should_choose_other_blocks() to check if the throughput distribution is imbalanced
- Uses a randomized check interval (mean_balance_check_period, default 120s) to avoid thundering herd
- Adds a random delay (mean_block_selection_delay, default 5s) before rebalancing to prevent race conditions
Rebalancing:
- If imbalanced: shuts down current blocks, selects new optimal blocks, restarts serving
- If container unhealthy: restarts the entire container with the same or new blocks
Graceful shutdown:
- Catches KeyboardInterrupt for clean shutdown
- Calls Server.shutdown() which stops the container and de-registers from DHT
Usage
This principle is automatically active when server.run() is called. Server operators can tune the balance checking behavior via --balance_quality and --mean_balance_check_period CLI flags.
Theoretical Basis
Decentralized rebalancing protocol:
# Abstract health monitoring loop
def run(server):
while True:
if not container.is_healthy():
restart_container()
if time_since_last_check > randomized_period():
swarm_state = query_dht()
if should_rebalance(swarm_state, balance_quality):
shutdown_current_blocks()
new_blocks = choose_best_blocks(num_blocks, swarm_state)
start_serving(new_blocks)
Why randomized intervals: Multiple servers checking simultaneously could all decide to move to the same blocks. Randomized timing prevents this coordination failure.