Workflow:FMInference FlexLLMGen HELM Benchmark Evaluation

Knowledge Sources	FlexLLMGen FlexGen: High-Throughput Generative Inference HELM Benchmark
Domains	LLM_Inference, Benchmarking, Model_Evaluation
Last Updated	2026-02-09 12:00 GMT

Overview

End-to-end process for evaluating OPT language models on Stanford's HELM benchmark scenarios using FlexLLMGen as the inference backend, enabling evaluation of large models on limited GPU hardware.

Description

This workflow integrates FlexLLMGen with the HELM (Holistic Evaluation of Language Models) benchmark framework (v0.2.1) to run standardized NLP evaluation scenarios. FlexLLMGen serves as the inference backend, replacing HELM's default API-based inference with local offloaded generation. This enables running HELM evaluations on models up to OPT-175B using a single commodity GPU. The integration covers scenario instantiation, adapter-based prompt construction, batched generation, metric computation, and result persistence.

Usage

Execute this workflow when you need to evaluate an OPT model's performance on standardized NLP benchmarks (such as MMLU, WikiFact, synthetic reasoning, or XSUM summarization) but have limited GPU resources. Suitable for research evaluation, model comparison, and quality assessment of large language models without requiring expensive multi-GPU infrastructure.

Execution Steps

Step 1: Install Dependencies

Install the HELM benchmark package (crfm-helm) alongside FlexLLMGen. The HELM framework provides scenario definitions, adapters for prompt formatting, and metrics for evaluation.

Key considerations:

Run pip install crfm-helm to install HELM v0.2.1
HELM provides standardized scenarios, adapters, and metrics
Only a subset of HELM scenarios has been tested with FlexLLMGen

Step 2: Select and Configure Scenario

Choose a HELM benchmark scenario to evaluate (e.g., MMLU with a specific subject) and configure the evaluation parameters including the model, offloading policy, batch sizes, sequence length padding, and maximum evaluation instances.

Key considerations:

The --description flag specifies the HELM scenario (e.g., mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical)
Use --pad-to-seq-len to set uniform sequence lengths for efficient batching
Configure --gpu-batch-size and --num-gpu-batches for throughput optimization
Use --max-eval-instance to limit evaluation size for testing

Step 3: Initialize Model and Tokenizer

Set up the FlexLLMGen OptLM model with the specified offloading policy and initialize a custom OptTokenizer wrapper that bridges HuggingFace tokenization with HELM's tokenization interface. The wrapper translates between HELM's TokenizationRequest/DecodeRequest protocol and the underlying AutoTokenizer.

Key considerations:

The OptTokenizer class adapts AutoTokenizer to HELM's TokenizationRequest interface
Supports both encoding (text to tokens) and decoding (tokens to text) operations
Handles truncation when required by HELM scenarios
Special handling for Galactica models (custom pad/eos tokens)

Step 4: Instantiate Scenario and Adapter

Use HELM's RunSpec to create the scenario (which loads the evaluation dataset), the adapter (which formats instances into prompts), and the metrics (which score model outputs). The scenario provides instances with input/output pairs, and the adapter constructs properly formatted prompts including few-shot examples.

Key considerations:

Scenarios are resolved from HELM's RunEntry descriptions
The adapter handles prompt construction, including multiple-choice formatting
A DataPreprocessor computes instance-level features

Step 5: Run Batched Generation

Process all evaluation instances through FlexLLMGen's generation pipeline. Instances are grouped into batches, padded to uniform length, and processed through the offloaded inference engine. For each instance, the adapter constructs a prompt, and the model generates a completion that is then truncated and scored.

Key considerations:

Instances are batched by the effective batch size (gpu_batch_size * num_gpu_batches)
Each batch is padded to the maximum sequence length in that batch or the --pad-to-seq-len value
The model is reinitialized for each batch to accommodate different sequence lengths
Stop sequences and token limits from HELM's request specification are respected

Step 6: Compute Metrics and Save Results

Evaluate the generated completions against HELM's metric suite, which may include accuracy, F1, ROUGE, calibration, and other task-specific metrics. Results are aggregated into a ScenarioState and persisted as JSON files for analysis.

Key considerations:

Metrics are computed per-instance and aggregated across the scenario
Results include per-instance stats and overall scenario scores
Output is saved as scenario_state.json and run_spec.json in the run directory
The HELM framework handles all metric computation through its MetricSpec system

Execution Diagram

GitHub URL

Workflow Repository