Implementation:FMInference FlexLLMGen Bench HF Suite

Knowledge Sources	FMInference_FlexLLMGen
Domains	Benchmarking, LLM Inference
Last Updated	2026-02-09 12:00 GMT

Overview

Benchmark harness that defines and executes predefined test suites for HuggingFace and DeepSpeed baseline inference across OPT model sizes and sequence lengths.

Description

The bench_hf module provides a structured benchmark framework for evaluating HuggingFace (HF) and DeepSpeed (DS) inference baselines against FlexLLMGen. It defines a Case dataclass that encapsulates all parameters for a single benchmark run (model name, library, prompt length, generation length, batch size, device placement, and node/GPU counts).

The module provides three key functions:

run_huggingface() constructs and executes the appropriate shell command to launch a benchmark run. For DeepSpeed, it uses the deepspeed launcher; for plain HuggingFace, it uses python. The command is assembled with flags for model path, prompt/generation lengths, batch size, and optional CPU/disk offloading and dummy weights.

bench_one_case() translates a Case instance into the parameters expected by run_huggingface(), handling device-to-flag mapping (gpu, cpu, disk) and library selection (hf vs. ds). Models larger than 6.7B use a truncated generation length (cut_gen_len=5) for latency projection.

The module defines 18 predefined suites covering three model sizes (6.7B, 30B, 175B), three sequence lengths (256, 512, 1024), and two libraries (HF, DS). These are aggregated into named collections in the suites dictionary for convenient batch execution from the command line.

Usage

Run this module as a script with one or more suite names as arguments to execute a batch of baseline benchmarks. It is used to generate comparison data for evaluating FlexLLMGen's performance against standard HuggingFace and DeepSpeed inference.

Code Reference

Source Location

Repository: FMInference_FlexLLMGen
File: benchmark/hf_ds/bench_hf.py
Lines: 1-164

Signature

@dataclass
class Case:
    model: str
    library: str
    prompt_len: int
    gen_len: int
    batch_size: int
    device: str
    num_nodes: int = 1
    num_gpus_per_node: int = 1

def run_huggingface(model, prompt_len, gen_len, cut_gen_len, batch_size,
                    num_nodes, num_gpus_per_node,
                    use_ds, cpu, disk, dummy, log_file=None, pkl_file=None):
    ...

def bench_one_case(case):
    ...

Import

from benchmark.hf_ds.bench_hf import Case, run_huggingface, bench_one_case, suites

I/O Contract

Inputs (Case dataclass)

Name	Type	Required	Description
model	str	Yes	HuggingFace model identifier (e.g., "facebook/opt-6.7b", "facebook/opt-30b", "facebook/opt-175b").
library	str	Yes	Inference library to use: "hf" for HuggingFace or "ds" for DeepSpeed.
prompt_len	int	Yes	Length of the input prompt in tokens.
gen_len	int	Yes	Number of tokens to generate.
batch_size	int	Yes	Number of prompts per batch.
device	str	Yes	Device placement strategy: "gpu", "cpu", or "disk".
num_nodes	int	No	Number of nodes (default 1; must be 1 currently).
num_gpus_per_node	int	No	Number of GPUs per node (default 1).

Inputs (run_huggingface)

Name	Type	Required	Description
model	str	Yes	HuggingFace model identifier.
prompt_len	int	Yes	Prompt token length.
gen_len	int	Yes	Generation token length.
cut_gen_len	int or None	Yes	Truncated generation length for latency projection (None for full generation).
batch_size	int	Yes	Batch size.
num_nodes	int	Yes	Number of nodes (must be 1).
num_gpus_per_node	int	Yes	Number of GPUs per node.
use_ds	bool	Yes	Whether to use the DeepSpeed launcher.
cpu	bool	Yes	Enable CPU offloading.
disk	bool	Yes	Enable disk offloading.
dummy	bool	Yes	Use dummy (random) weights instead of real model weights.
log_file	str	No	Path to log file for results.
pkl_file	str	No	Path to pickle file for serialized results.

Outputs

Name	Type	Description
(side effect)	None	Launches a subprocess running the HuggingFace/DeepSpeed benchmark via run_cmd(). Results are printed to stdout and optionally written to log/pickle files.

Usage Examples

# Command-line usage: run all HuggingFace suites at sequence length 512
# python benchmark/hf_ds/bench_hf.py hf_s512

# Run a specific combination of suites
# python benchmark/hf_ds/bench_hf.py 6b7 30b

# Programmatic usage:
from benchmark.hf_ds.bench_hf import Case, bench_one_case

case = Case(
    model="facebook/opt-6.7b",
    library="hf",
    prompt_len=512,
    gen_len=32,
    batch_size=2,
    device="gpu",
)
bench_one_case(case)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment