Implementation:FMInference FlexLLMGen Petals Benchmark
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Decentralized Inference, LLM Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Multi-process benchmark harness for measuring throughput and latency of Petals decentralized inference, using OPT model configurations mapped onto the Petals BLOOM-based client.
Description
The run_opt_requests module benchmarks the Petals decentralized inference framework by spawning multiple client processes that concurrently issue generation requests against a Petals swarm. Because Petals natively uses BLOOM architecture, the module includes a _patch_bloom_config() helper that maps OPT model dimensions (hidden_size, num_attention_heads, num_hidden_layers, vocab_size) onto a DistributedBloomConfig, enabling benchmarking of OPT-equivalent workloads.
Key components:
- client_process() is the worker function spawned as a separate process. Each worker creates a DistributedBloomForCausalLM model instance connected to the Petals DHT swarm, performs a single-token warmup generation, then synchronizes with other workers via Event objects before executing num_micro_batches rounds of generation. Each round's latency is reported back via a Queue.
- run_bench() orchestrates a single benchmark configuration: it spawns num_processes worker processes, waits for all to complete warmup, fires the start signal, and then joins all processes. After completion, it collects latency measurements from the queue, computes total throughput (tokens per second) and mean latency, and appends the results to a tab-separated output file.
- main() parses CLI arguments (initial DHT peers, model prefix, batch size, number of micro-batches, number of processes, and output file), configures the Bloom config with OPT-equivalent dimensions, and runs benchmarks across sequence lengths 256, 512, and 1024. For the OPT-30B model, it additionally runs a sweep over max_tokens from 0 to 32.
Usage
Run this module as a script to benchmark Petals inference. It requires a running Petals swarm with peers serving the target model. The results are used to compare decentralized inference throughput/latency against FlexLLMGen's centralized offloading approach.
Code Reference
Source Location
- Repository: FMInference_FlexLLMGen
- File: benchmark/petals/run_opt_requests.py
- Lines: 1-133
Signature
def client_process(
finished_warmup, can_start, config_bloom, num_micro_batches,
batch_size, sequence_length, max_tokens, process_index, queue
) -> None:
...
def run_bench(args, sequence_length, max_tokens, config_bloom):
...
def main():
...
Import
from benchmark.petals.run_opt_requests import client_process, run_bench, main
I/O Contract
Inputs (client_process)
| Name | Type | Required | Description |
|---|---|---|---|
| finished_warmup | multiprocessing.Event | Yes | Event that this worker sets after completing warmup generation. |
| can_start | multiprocessing.Event | Yes | Event that workers wait on before starting the timed benchmark runs. |
| config_bloom | DistributedBloomConfig | Yes | Petals model configuration with DHT peer information and model dimensions. |
| num_micro_batches | int | Yes | Number of generation rounds to execute per worker. |
| batch_size | int | Yes | Number of prompts per generation call. |
| sequence_length | int | Yes | Length of the random input sequence in tokens. |
| max_tokens | int | Yes | Number of new tokens to generate per call. |
| process_index | int | Yes | Index of this worker process (used for CUDA device assignment). |
| queue | multiprocessing.Queue | Yes | Queue for reporting per-round latency measurements back to the orchestrator. |
Inputs (run_bench)
| Name | Type | Required | Description |
|---|---|---|---|
| args | argparse.Namespace | Yes | Parsed CLI arguments containing batch_size, num_micro_batches, num_processes, and output file path. |
| sequence_length | int | Yes | Input sequence length for this benchmark run. |
| max_tokens | int | Yes | Number of tokens to generate. |
| config_bloom | DistributedBloomConfig | Yes | Petals model configuration. |
Inputs (main CLI arguments)
| Name | Type | Required | Description |
|---|---|---|---|
| --initial_peers | list of str | No | Multiaddrs of Petals DHT peers (e.g., /ip4/203.0.113.1/tcp/31337/p2p/XXXX). |
| --prefix | str | No | Model prefix/identifier (default: "facebook/opt-175b"). |
| --batch-size | int | No | Batch size per worker (default: 1). |
| --num-micro-batches | int | No | Rounds per worker (default: 1). |
| --num-processes | int | No | Number of concurrent client processes (default: 1). |
| --output | str | Yes | Path to the output file for tab-separated results. |
Outputs
| Name | Type | Description |
|---|---|---|
| output file | TSV file | Tab-separated file with columns: batch_size, num_micro_batches, num_processes, sequence_length, max_tokens, throughput (tokens/s), mean_latency (seconds). Results are appended for each benchmark run. |
Usage Examples
# Command-line usage: benchmark OPT-30b via Petals with 4 concurrent clients
# python benchmark/petals/run_opt_requests.py \
# --initial_peers /ip4/192.168.1.10/tcp/31337/p2p/QmXXXX \
# --prefix facebook/opt-30b \
# --batch-size 1 \
# --num-micro-batches 3 \
# --num-processes 4 \
# --output results/petals_opt30b.tsv
# Benchmark OPT-175b (default prefix)
# python benchmark/petals/run_opt_requests.py \
# --initial_peers /ip4/10.0.0.1/tcp/31337/p2p/QmYYYY \
# --num-processes 2 \
# --output results/petals_opt175b.tsv