Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen Bench HF Suite

From Leeroopedia


Knowledge Sources
Domains Benchmarking, LLM Inference
Last Updated 2026-02-09 12:00 GMT

Overview

Benchmark harness that defines and executes predefined test suites for HuggingFace and DeepSpeed baseline inference across OPT model sizes and sequence lengths.

Description

The bench_hf module provides a structured benchmark framework for evaluating HuggingFace (HF) and DeepSpeed (DS) inference baselines against FlexLLMGen. It defines a Case dataclass that encapsulates all parameters for a single benchmark run (model name, library, prompt length, generation length, batch size, device placement, and node/GPU counts).

The module provides three key functions:

  • run_huggingface() constructs and executes the appropriate shell command to launch a benchmark run. For DeepSpeed, it uses the deepspeed launcher; for plain HuggingFace, it uses python. The command is assembled with flags for model path, prompt/generation lengths, batch size, and optional CPU/disk offloading and dummy weights.
  • bench_one_case() translates a Case instance into the parameters expected by run_huggingface(), handling device-to-flag mapping (gpu, cpu, disk) and library selection (hf vs. ds). Models larger than 6.7B use a truncated generation length (cut_gen_len=5) for latency projection.
  • The module defines 18 predefined suites covering three model sizes (6.7B, 30B, 175B), three sequence lengths (256, 512, 1024), and two libraries (HF, DS). These are aggregated into named collections in the suites dictionary for convenient batch execution from the command line.

Usage

Run this module as a script with one or more suite names as arguments to execute a batch of baseline benchmarks. It is used to generate comparison data for evaluating FlexLLMGen's performance against standard HuggingFace and DeepSpeed inference.

Code Reference

Source Location

Signature

@dataclass
class Case:
    model: str
    library: str
    prompt_len: int
    gen_len: int
    batch_size: int
    device: str
    num_nodes: int = 1
    num_gpus_per_node: int = 1

def run_huggingface(model, prompt_len, gen_len, cut_gen_len, batch_size,
                    num_nodes, num_gpus_per_node,
                    use_ds, cpu, disk, dummy, log_file=None, pkl_file=None):
    ...

def bench_one_case(case):
    ...

Import

from benchmark.hf_ds.bench_hf import Case, run_huggingface, bench_one_case, suites

I/O Contract

Inputs (Case dataclass)

Name Type Required Description
model str Yes HuggingFace model identifier (e.g., "facebook/opt-6.7b", "facebook/opt-30b", "facebook/opt-175b").
library str Yes Inference library to use: "hf" for HuggingFace or "ds" for DeepSpeed.
prompt_len int Yes Length of the input prompt in tokens.
gen_len int Yes Number of tokens to generate.
batch_size int Yes Number of prompts per batch.
device str Yes Device placement strategy: "gpu", "cpu", or "disk".
num_nodes int No Number of nodes (default 1; must be 1 currently).
num_gpus_per_node int No Number of GPUs per node (default 1).

Inputs (run_huggingface)

Name Type Required Description
model str Yes HuggingFace model identifier.
prompt_len int Yes Prompt token length.
gen_len int Yes Generation token length.
cut_gen_len int or None Yes Truncated generation length for latency projection (None for full generation).
batch_size int Yes Batch size.
num_nodes int Yes Number of nodes (must be 1).
num_gpus_per_node int Yes Number of GPUs per node.
use_ds bool Yes Whether to use the DeepSpeed launcher.
cpu bool Yes Enable CPU offloading.
disk bool Yes Enable disk offloading.
dummy bool Yes Use dummy (random) weights instead of real model weights.
log_file str No Path to log file for results.
pkl_file str No Path to pickle file for serialized results.

Outputs

Name Type Description
(side effect) None Launches a subprocess running the HuggingFace/DeepSpeed benchmark via run_cmd(). Results are printed to stdout and optionally written to log/pickle files.

Usage Examples

# Command-line usage: run all HuggingFace suites at sequence length 512
# python benchmark/hf_ds/bench_hf.py hf_s512

# Run a specific combination of suites
# python benchmark/hf_ds/bench_hf.py 6b7 30b

# Programmatic usage:
from benchmark.hf_ds.bench_hf import Case, bench_one_case

case = Case(
    model="facebook/opt-6.7b",
    library="hf",
    prompt_len=512,
    gen_len=32,
    batch_size=2,
    device="gpu",
)
bench_one_case(case)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment