Implementation:FMInference FlexLLMGen Petals Benchmark

Knowledge Sources	FMInference_FlexLLMGen
Domains	Benchmarking, Decentralized Inference, LLM Inference
Last Updated	2026-02-09 12:00 GMT

Overview

Multi-process benchmark harness for measuring throughput and latency of Petals decentralized inference, using OPT model configurations mapped onto the Petals BLOOM-based client.

Description

The run_opt_requests module benchmarks the Petals decentralized inference framework by spawning multiple client processes that concurrently issue generation requests against a Petals swarm. Because Petals natively uses BLOOM architecture, the module includes a _patch_bloom_config() helper that maps OPT model dimensions (hidden_size, num_attention_heads, num_hidden_layers, vocab_size) onto a DistributedBloomConfig, enabling benchmarking of OPT-equivalent workloads.

Key components:

client_process() is the worker function spawned as a separate process. Each worker creates a DistributedBloomForCausalLM model instance connected to the Petals DHT swarm, performs a single-token warmup generation, then synchronizes with other workers via Event objects before executing num_micro_batches rounds of generation. Each round's latency is reported back via a Queue.

run_bench() orchestrates a single benchmark configuration: it spawns num_processes worker processes, waits for all to complete warmup, fires the start signal, and then joins all processes. After completion, it collects latency measurements from the queue, computes total throughput (tokens per second) and mean latency, and appends the results to a tab-separated output file.

main() parses CLI arguments (initial DHT peers, model prefix, batch size, number of micro-batches, number of processes, and output file), configures the Bloom config with OPT-equivalent dimensions, and runs benchmarks across sequence lengths 256, 512, and 1024. For the OPT-30B model, it additionally runs a sweep over max_tokens from 0 to 32.

Usage

Run this module as a script to benchmark Petals inference. It requires a running Petals swarm with peers serving the target model. The results are used to compare decentralized inference throughput/latency against FlexLLMGen's centralized offloading approach.

Code Reference

Source Location

Repository: FMInference_FlexLLMGen
File: benchmark/petals/run_opt_requests.py
Lines: 1-133

Signature

def client_process(
    finished_warmup, can_start, config_bloom, num_micro_batches,
    batch_size, sequence_length, max_tokens, process_index, queue
) -> None:
    ...

def run_bench(args, sequence_length, max_tokens, config_bloom):
    ...

def main():
    ...

Import

from benchmark.petals.run_opt_requests import client_process, run_bench, main

I/O Contract

Inputs (client_process)

Name	Type	Required	Description
finished_warmup	multiprocessing.Event	Yes	Event that this worker sets after completing warmup generation.
can_start	multiprocessing.Event	Yes	Event that workers wait on before starting the timed benchmark runs.
config_bloom	DistributedBloomConfig	Yes	Petals model configuration with DHT peer information and model dimensions.
num_micro_batches	int	Yes	Number of generation rounds to execute per worker.
batch_size	int	Yes	Number of prompts per generation call.
sequence_length	int	Yes	Length of the random input sequence in tokens.
max_tokens	int	Yes	Number of new tokens to generate per call.
process_index	int	Yes	Index of this worker process (used for CUDA device assignment).
queue	multiprocessing.Queue	Yes	Queue for reporting per-round latency measurements back to the orchestrator.

Inputs (run_bench)

Name	Type	Required	Description
args	argparse.Namespace	Yes	Parsed CLI arguments containing batch_size, num_micro_batches, num_processes, and output file path.
sequence_length	int	Yes	Input sequence length for this benchmark run.
max_tokens	int	Yes	Number of tokens to generate.
config_bloom	DistributedBloomConfig	Yes	Petals model configuration.

Inputs (main CLI arguments)

Name	Type	Required	Description
--initial_peers	list of str	No	Multiaddrs of Petals DHT peers (e.g., /ip4/203.0.113.1/tcp/31337/p2p/XXXX).
--prefix	str	No	Model prefix/identifier (default: "facebook/opt-175b").
--batch-size	int	No	Batch size per worker (default: 1).
--num-micro-batches	int	No	Rounds per worker (default: 1).
--num-processes	int	No	Number of concurrent client processes (default: 1).
--output	str	Yes	Path to the output file for tab-separated results.

Outputs

Name	Type	Description
output file	TSV file	Tab-separated file with columns: batch_size, num_micro_batches, num_processes, sequence_length, max_tokens, throughput (tokens/s), mean_latency (seconds). Results are appended for each benchmark run.

Usage Examples

# Command-line usage: benchmark OPT-30b via Petals with 4 concurrent clients
# python benchmark/petals/run_opt_requests.py \
#     --initial_peers /ip4/192.168.1.10/tcp/31337/p2p/QmXXXX \
#     --prefix facebook/opt-30b \
#     --batch-size 1 \
#     --num-micro-batches 3 \
#     --num-processes 4 \
#     --output results/petals_opt30b.tsv

# Benchmark OPT-175b (default prefix)
# python benchmark/petals/run_opt_requests.py \
#     --initial_peers /ip4/10.0.0.1/tcp/31337/p2p/QmYYYY \
#     --num-processes 2 \
#     --output results/petals_opt175b.tsv

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment