Principle:FMInference FlexLLMGen Decentralized Inference Benchmarking
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Decentralized Inference, Performance Evaluation |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Measuring decentralized inference performance requires synchronized multi-client workload generation with coordinated warmup, barrier-based start alignment, and per-request latency collection.
Description
Decentralized inference systems distribute model layers across a network of volunteer or dedicated peers, with clients routing requests through a distributed hash table (DHT). Benchmarking such systems presents unique challenges compared to centralized inference:
1. Client-Side Concurrency: In decentralized systems, throughput scales with the number of concurrent clients sending requests. A meaningful benchmark must simulate realistic concurrency levels by spawning multiple client processes, each independently issuing generation requests. The total throughput is the aggregate across all clients.
2. Warmup-Barrier Synchronization: Each client process must connect to the DHT, discover peers, and establish inference sessions before the first request. This warmup phase has highly variable latency depending on network conditions and peer availability. To ensure all clients start their timed runs simultaneously, a two-phase synchronization is used:
- Phase 1 (Warmup): Each client performs a single-token generation to establish connections and warm JIT caches. It signals completion via an Event.
- Phase 2 (Barrier): The orchestrator waits for all warmup events, then fires a shared start Event. All clients begin timed generation simultaneously.
3. Architecture Adaptation: When the target model uses a different architecture than the decentralized framework natively supports, model configuration must be adapted. For example, mapping OPT model dimensions onto BLOOM configuration objects allows benchmarking OPT-equivalent workloads on BLOOM-based infrastructure. This adaptation preserves the computational profile (hidden size, number of attention heads, number of layers, vocabulary size) while using the target framework's serving infrastructure.
4. Latency Collection via Queues: Each client process measures per-request wall-clock latency and pushes it to a shared inter-process queue. The orchestrator drains the queue after all processes complete, enabling both mean latency and aggregate throughput computation without per-process file I/O.
5. Multi-Dimensional Sweep: A thorough evaluation varies sequence length and generation length to characterize the system's prefill-vs-decode performance tradeoff. Longer sequences stress the prefill path (network bandwidth for transferring hidden states through many peers), while longer generation lengths stress the decode path (round-trip latency per token across the peer chain).
Usage
Apply this principle when benchmarking any decentralized or federated inference system where multiple clients share a pool of distributed model-serving peers. The coordinated warmup, barrier synchronization, and queue-based latency collection pattern generalizes beyond any specific framework.
Theoretical Basis
Throughput and Latency Definitions
For P client processes, each running M micro-batches of batch size B, generating T tokens per request:
total_tokens = P * M * B * T
throughput = total_tokens / wall_clock_time
mean_latency = (1 / (P * M)) * sum(per_request_latencies)
Note that throughput uses wall-clock time (measuring aggregate system capacity), while mean_latency averages individual request durations (measuring per-request responsiveness). These are complementary metrics: a system can have high throughput but high latency if it processes many requests in parallel, or low latency but low throughput if it has limited concurrency.
Warmup Necessity
The first inference request in a decentralized system incurs one-time costs:
- DHT discovery: Finding peers that serve each layer of the model.
- Session establishment: Opening gRPC/TCP connections to each peer in the inference chain.
- CUDA warmup: JIT compilation of kernels on both client and server GPUs.
These costs can be 10-100x the steady-state per-request latency. Excluding them from timed measurements is critical for representative results.
Configuration Mapping
When adapting model configurations across architectures, the key equivalences are:
| Source Architecture (OPT) | Target Architecture (BLOOM) | Semantic Meaning |
|---|---|---|
| hidden_size | hidden_size | Dimension of hidden state vectors |
| num_attention_heads | n_head | Number of attention heads per layer |
| num_hidden_layers | n_layer | Number of transformer layers |
| vocab_size | vocab_size | Size of the token vocabulary |
These four parameters determine the computational cost per layer and the total memory footprint, making them sufficient for workload-equivalent benchmarking even when architectural details (e.g., attention bias, layer norm placement) differ.