Implementation:FMInference FlexLLMGen Bench HF Suite
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, LLM Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Benchmark harness that defines and executes predefined test suites for HuggingFace and DeepSpeed baseline inference across OPT model sizes and sequence lengths.
Description
The bench_hf module provides a structured benchmark framework for evaluating HuggingFace (HF) and DeepSpeed (DS) inference baselines against FlexLLMGen. It defines a Case dataclass that encapsulates all parameters for a single benchmark run (model name, library, prompt length, generation length, batch size, device placement, and node/GPU counts).
The module provides three key functions:
- run_huggingface() constructs and executes the appropriate shell command to launch a benchmark run. For DeepSpeed, it uses the deepspeed launcher; for plain HuggingFace, it uses python. The command is assembled with flags for model path, prompt/generation lengths, batch size, and optional CPU/disk offloading and dummy weights.
- bench_one_case() translates a Case instance into the parameters expected by run_huggingface(), handling device-to-flag mapping (gpu, cpu, disk) and library selection (hf vs. ds). Models larger than 6.7B use a truncated generation length (cut_gen_len=5) for latency projection.
- The module defines 18 predefined suites covering three model sizes (6.7B, 30B, 175B), three sequence lengths (256, 512, 1024), and two libraries (HF, DS). These are aggregated into named collections in the suites dictionary for convenient batch execution from the command line.
Usage
Run this module as a script with one or more suite names as arguments to execute a batch of baseline benchmarks. It is used to generate comparison data for evaluating FlexLLMGen's performance against standard HuggingFace and DeepSpeed inference.
Code Reference
Source Location
- Repository: FMInference_FlexLLMGen
- File: benchmark/hf_ds/bench_hf.py
- Lines: 1-164
Signature
@dataclass
class Case:
model: str
library: str
prompt_len: int
gen_len: int
batch_size: int
device: str
num_nodes: int = 1
num_gpus_per_node: int = 1
def run_huggingface(model, prompt_len, gen_len, cut_gen_len, batch_size,
num_nodes, num_gpus_per_node,
use_ds, cpu, disk, dummy, log_file=None, pkl_file=None):
...
def bench_one_case(case):
...
Import
from benchmark.hf_ds.bench_hf import Case, run_huggingface, bench_one_case, suites
I/O Contract
Inputs (Case dataclass)
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | HuggingFace model identifier (e.g., "facebook/opt-6.7b", "facebook/opt-30b", "facebook/opt-175b"). |
| library | str | Yes | Inference library to use: "hf" for HuggingFace or "ds" for DeepSpeed. |
| prompt_len | int | Yes | Length of the input prompt in tokens. |
| gen_len | int | Yes | Number of tokens to generate. |
| batch_size | int | Yes | Number of prompts per batch. |
| device | str | Yes | Device placement strategy: "gpu", "cpu", or "disk". |
| num_nodes | int | No | Number of nodes (default 1; must be 1 currently). |
| num_gpus_per_node | int | No | Number of GPUs per node (default 1). |
Inputs (run_huggingface)
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | HuggingFace model identifier. |
| prompt_len | int | Yes | Prompt token length. |
| gen_len | int | Yes | Generation token length. |
| cut_gen_len | int or None | Yes | Truncated generation length for latency projection (None for full generation). |
| batch_size | int | Yes | Batch size. |
| num_nodes | int | Yes | Number of nodes (must be 1). |
| num_gpus_per_node | int | Yes | Number of GPUs per node. |
| use_ds | bool | Yes | Whether to use the DeepSpeed launcher. |
| cpu | bool | Yes | Enable CPU offloading. |
| disk | bool | Yes | Enable disk offloading. |
| dummy | bool | Yes | Use dummy (random) weights instead of real model weights. |
| log_file | str | No | Path to log file for results. |
| pkl_file | str | No | Path to pickle file for serialized results. |
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | Launches a subprocess running the HuggingFace/DeepSpeed benchmark via run_cmd(). Results are printed to stdout and optionally written to log/pickle files. |
Usage Examples
# Command-line usage: run all HuggingFace suites at sequence length 512
# python benchmark/hf_ds/bench_hf.py hf_s512
# Run a specific combination of suites
# python benchmark/hf_ds/bench_hf.py 6b7 30b
# Programmatic usage:
from benchmark.hf_ds.bench_hf import Case, bench_one_case
case = Case(
model="facebook/opt-6.7b",
library="hf",
prompt_len=512,
gen_len=32,
batch_size=2,
device="gpu",
)
bench_one_case(case)