Workflow:FMInference FlexLLMGen HELM Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Benchmarking, Model_Evaluation |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end process for evaluating OPT language models on Stanford's HELM benchmark scenarios using FlexLLMGen as the inference backend, enabling evaluation of large models on limited GPU hardware.
Description
This workflow integrates FlexLLMGen with the HELM (Holistic Evaluation of Language Models) benchmark framework (v0.2.1) to run standardized NLP evaluation scenarios. FlexLLMGen serves as the inference backend, replacing HELM's default API-based inference with local offloaded generation. This enables running HELM evaluations on models up to OPT-175B using a single commodity GPU. The integration covers scenario instantiation, adapter-based prompt construction, batched generation, metric computation, and result persistence.
Usage
Execute this workflow when you need to evaluate an OPT model's performance on standardized NLP benchmarks (such as MMLU, WikiFact, synthetic reasoning, or XSUM summarization) but have limited GPU resources. Suitable for research evaluation, model comparison, and quality assessment of large language models without requiring expensive multi-GPU infrastructure.
Execution Steps
Step 1: Install Dependencies
Install the HELM benchmark package (crfm-helm) alongside FlexLLMGen. The HELM framework provides scenario definitions, adapters for prompt formatting, and metrics for evaluation.
Key considerations:
- Run pip install crfm-helm to install HELM v0.2.1
- HELM provides standardized scenarios, adapters, and metrics
- Only a subset of HELM scenarios has been tested with FlexLLMGen
Step 2: Select and Configure Scenario
Choose a HELM benchmark scenario to evaluate (e.g., MMLU with a specific subject) and configure the evaluation parameters including the model, offloading policy, batch sizes, sequence length padding, and maximum evaluation instances.
Key considerations:
- The --description flag specifies the HELM scenario (e.g., mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical)
- Use --pad-to-seq-len to set uniform sequence lengths for efficient batching
- Configure --gpu-batch-size and --num-gpu-batches for throughput optimization
- Use --max-eval-instance to limit evaluation size for testing
Step 3: Initialize Model and Tokenizer
Set up the FlexLLMGen OptLM model with the specified offloading policy and initialize a custom OptTokenizer wrapper that bridges HuggingFace tokenization with HELM's tokenization interface. The wrapper translates between HELM's TokenizationRequest/DecodeRequest protocol and the underlying AutoTokenizer.
Key considerations:
- The OptTokenizer class adapts AutoTokenizer to HELM's TokenizationRequest interface
- Supports both encoding (text to tokens) and decoding (tokens to text) operations
- Handles truncation when required by HELM scenarios
- Special handling for Galactica models (custom pad/eos tokens)
Step 4: Instantiate Scenario and Adapter
Use HELM's RunSpec to create the scenario (which loads the evaluation dataset), the adapter (which formats instances into prompts), and the metrics (which score model outputs). The scenario provides instances with input/output pairs, and the adapter constructs properly formatted prompts including few-shot examples.
Key considerations:
- Scenarios are resolved from HELM's RunEntry descriptions
- The adapter handles prompt construction, including multiple-choice formatting
- A DataPreprocessor computes instance-level features
Step 5: Run Batched Generation
Process all evaluation instances through FlexLLMGen's generation pipeline. Instances are grouped into batches, padded to uniform length, and processed through the offloaded inference engine. For each instance, the adapter constructs a prompt, and the model generates a completion that is then truncated and scored.
Key considerations:
- Instances are batched by the effective batch size (gpu_batch_size * num_gpu_batches)
- Each batch is padded to the maximum sequence length in that batch or the --pad-to-seq-len value
- The model is reinitialized for each batch to accommodate different sequence lengths
- Stop sequences and token limits from HELM's request specification are respected
Step 6: Compute Metrics and Save Results
Evaluate the generated completions against HELM's metric suite, which may include accuracy, F1, ROUGE, calibration, and other task-specific metrics. Results are aggregated into a ScenarioState and persisted as JSON files for analysis.
Key considerations:
- Metrics are computed per-instance and aggregated across the scenario
- Results include per-instance stats and overall scenario scores
- Output is saved as scenario_state.json and run_spec.json in the run directory
- The HELM framework handles all metric computation through its MetricSpec system