Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen Petals Benchmark

From Leeroopedia
Revision as of 14:56, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FMInference_FlexLLMGen_Petals_Benchmark.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Benchmarking, Decentralized Inference, LLM Inference
Last Updated 2026-02-09 12:00 GMT

Overview

Multi-process benchmark harness for measuring throughput and latency of Petals decentralized inference, using OPT model configurations mapped onto the Petals BLOOM-based client.

Description

The run_opt_requests module benchmarks the Petals decentralized inference framework by spawning multiple client processes that concurrently issue generation requests against a Petals swarm. Because Petals natively uses BLOOM architecture, the module includes a _patch_bloom_config() helper that maps OPT model dimensions (hidden_size, num_attention_heads, num_hidden_layers, vocab_size) onto a DistributedBloomConfig, enabling benchmarking of OPT-equivalent workloads.

Key components:

  • client_process() is the worker function spawned as a separate process. Each worker creates a DistributedBloomForCausalLM model instance connected to the Petals DHT swarm, performs a single-token warmup generation, then synchronizes with other workers via Event objects before executing num_micro_batches rounds of generation. Each round's latency is reported back via a Queue.
  • run_bench() orchestrates a single benchmark configuration: it spawns num_processes worker processes, waits for all to complete warmup, fires the start signal, and then joins all processes. After completion, it collects latency measurements from the queue, computes total throughput (tokens per second) and mean latency, and appends the results to a tab-separated output file.
  • main() parses CLI arguments (initial DHT peers, model prefix, batch size, number of micro-batches, number of processes, and output file), configures the Bloom config with OPT-equivalent dimensions, and runs benchmarks across sequence lengths 256, 512, and 1024. For the OPT-30B model, it additionally runs a sweep over max_tokens from 0 to 32.

Usage

Run this module as a script to benchmark Petals inference. It requires a running Petals swarm with peers serving the target model. The results are used to compare decentralized inference throughput/latency against FlexLLMGen's centralized offloading approach.

Code Reference

Source Location

Signature

def client_process(
    finished_warmup, can_start, config_bloom, num_micro_batches,
    batch_size, sequence_length, max_tokens, process_index, queue
) -> None:
    ...

def run_bench(args, sequence_length, max_tokens, config_bloom):
    ...

def main():
    ...

Import

from benchmark.petals.run_opt_requests import client_process, run_bench, main

I/O Contract

Inputs (client_process)

Name Type Required Description
finished_warmup multiprocessing.Event Yes Event that this worker sets after completing warmup generation.
can_start multiprocessing.Event Yes Event that workers wait on before starting the timed benchmark runs.
config_bloom DistributedBloomConfig Yes Petals model configuration with DHT peer information and model dimensions.
num_micro_batches int Yes Number of generation rounds to execute per worker.
batch_size int Yes Number of prompts per generation call.
sequence_length int Yes Length of the random input sequence in tokens.
max_tokens int Yes Number of new tokens to generate per call.
process_index int Yes Index of this worker process (used for CUDA device assignment).
queue multiprocessing.Queue Yes Queue for reporting per-round latency measurements back to the orchestrator.

Inputs (run_bench)

Name Type Required Description
args argparse.Namespace Yes Parsed CLI arguments containing batch_size, num_micro_batches, num_processes, and output file path.
sequence_length int Yes Input sequence length for this benchmark run.
max_tokens int Yes Number of tokens to generate.
config_bloom DistributedBloomConfig Yes Petals model configuration.

Inputs (main CLI arguments)

Name Type Required Description
--initial_peers list of str No Multiaddrs of Petals DHT peers (e.g., /ip4/203.0.113.1/tcp/31337/p2p/XXXX).
--prefix str No Model prefix/identifier (default: "facebook/opt-175b").
--batch-size int No Batch size per worker (default: 1).
--num-micro-batches int No Rounds per worker (default: 1).
--num-processes int No Number of concurrent client processes (default: 1).
--output str Yes Path to the output file for tab-separated results.

Outputs

Name Type Description
output file TSV file Tab-separated file with columns: batch_size, num_micro_batches, num_processes, sequence_length, max_tokens, throughput (tokens/s), mean_latency (seconds). Results are appended for each benchmark run.

Usage Examples

# Command-line usage: benchmark OPT-30b via Petals with 4 concurrent clients
# python benchmark/petals/run_opt_requests.py \
#     --initial_peers /ip4/192.168.1.10/tcp/31337/p2p/QmXXXX \
#     --prefix facebook/opt-30b \
#     --batch-size 1 \
#     --num-micro-batches 3 \
#     --num-processes 4 \
#     --output results/petals_opt30b.tsv

# Benchmark OPT-175b (default prefix)
# python benchmark/petals/run_opt_requests.py \
#     --initial_peers /ip4/10.0.0.1/tcp/31337/p2p/QmYYYY \
#     --num-processes 2 \
#     --output results/petals_opt175b.tsv

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment