Principle:Bigscience workshop Petals Health Monitoring

Knowledge Sources	Petals Petals: Collaborative Inference and Fine-tuning of Large Models
Domains	Distributed_Computing, Infrastructure, Monitoring
Last Updated	2026-02-09 14:00 GMT

Overview

A continuous monitoring loop that checks server health, evaluates swarm balance, and triggers automatic rebalancing when the network's block coverage becomes suboptimal.

Description

Health Monitoring ensures the long-term stability of both individual servers and the overall Petals swarm. The monitoring loop runs in Server.run() and performs:

Container health checks:

Verifies that the ModuleContainer thread is alive and responsive
Detects crashes or OOM failures in block serving

Swarm balance evaluation:

Periodically queries DHT for current block coverage
Calls should_choose_other_blocks() to check if the throughput distribution is imbalanced
Uses a randomized check interval (mean_balance_check_period, default 120s) to avoid thundering herd
Adds a random delay (mean_block_selection_delay, default 5s) before rebalancing to prevent race conditions

Rebalancing:

If imbalanced: shuts down current blocks, selects new optimal blocks, restarts serving
If container unhealthy: restarts the entire container with the same or new blocks

Graceful shutdown:

Catches KeyboardInterrupt for clean shutdown
Calls Server.shutdown() which stops the container and de-registers from DHT

Usage

This principle is automatically active when server.run() is called. Server operators can tune the balance checking behavior via --balance_quality and --mean_balance_check_period CLI flags.

Theoretical Basis

Decentralized rebalancing protocol:

# Abstract health monitoring loop
def run(server):
    while True:
        if not container.is_healthy():
            restart_container()

        if time_since_last_check > randomized_period():
            swarm_state = query_dht()
            if should_rebalance(swarm_state, balance_quality):
                shutdown_current_blocks()
                new_blocks = choose_best_blocks(num_blocks, swarm_state)
                start_serving(new_blocks)

Why randomized intervals: Multiple servers checking simultaneously could all decide to move to the same blocks. Randomized timing prevents this coordination failure.

Related Pages

Implemented By

Implementation:Bigscience_workshop_Petals_Server_Run

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment