Principle:FMInference FlexLLMGen Baseline Benchmark Orchestration
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, LLM Inference, Performance Evaluation |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Systematic comparison against established inference frameworks requires structured test suites that vary model size, sequence length, and offloading strategy across a controlled grid of configurations.
Description
When evaluating a novel inference system, it is essential to establish fair baselines against existing frameworks under identical hardware and workload conditions. The principle of baseline benchmark orchestration addresses this by defining a structured approach to benchmark design:
1. Configuration as Data: Each benchmark run is encoded as a typed record (a dataclass or similar structure) containing all parameters: model identifier, inference library, prompt length, generation length, batch size, and device placement. This makes benchmark configurations inspectable, serializable, and composable.
2. Suite Composition: Individual configurations are grouped into named suites that vary along controlled dimensions. A well-designed suite grid covers:
- Model scale axis: Small (6.7B), medium (30B), and large (175B) models to show scaling behavior.
- Sequence length axis: Short (256), medium (512), and long (1024) prompts to test memory and compute scaling.
- Framework axis: Multiple inference libraries (e.g., native HuggingFace vs. DeepSpeed) to isolate framework overhead.
3. Latency Projection: For very large models where full generation is prohibitively expensive, a truncated generation length is used and the results are extrapolated to the target length. This technique produces approximate but directionally correct comparisons without requiring hours per data point.
4. Device Placement Strategy: The benchmark grid includes runs with weights and caches placed on GPU, CPU, and disk. This tests the full spectrum of offloading scenarios, matching the diverse deployment environments that large model inference systems must support.
Usage
Apply this principle when designing benchmark suites for any inference system that needs to be compared against existing baselines. The structured grid approach ensures reproducibility and comprehensive coverage of the performance landscape.
Theoretical Basis
Controlled Variable Design
A rigorous benchmark comparison requires holding all variables constant except the one being evaluated. The grid structure achieves this:
For each model_size in {6.7B, 30B, 175B}:
For each seq_len in {256, 512, 1024}:
For each framework in {HuggingFace, DeepSpeed}:
run_benchmark(model_size, seq_len, framework)
This yields 3 x 3 x 2 = 18 configurations per full sweep. By holding two dimensions fixed and varying the third, we can isolate the effect of each factor on throughput and latency.
Latency Projection for Large Models
When full generation is too slow for practical benchmarking, a truncated run of k tokens (where k << gen_len) is executed. The per-token decode latency is measured and projected:
projected_decode_latency = measured_per_token_latency * (gen_len - 1)
total_latency = prefill_latency + projected_decode_latency
This approximation is valid when per-token latency is roughly constant after the prefill phase, which holds for autoregressive decoding with a fixed KV-cache size growth pattern.
Batch Size Selection
The predefined suites use different batch sizes depending on model size and device placement. Larger models with disk offloading use smaller batch sizes (1-2) because I/O latency dominates, while smaller models on GPU can use larger batches (up to 32) to maximize GPU utilization. The batch size is chosen to approximately maximize throughput given the memory constraints of each configuration.