Implementation:Allenai Open instruct Benchmark Generators

Knowledge Sources	Allenai_Open_instruct
Domains	Benchmarking, Performance
Last Updated	2026-02-07 02:00 GMT

Overview

Concrete tool for benchmarking vLLM generator performance in GRPO-like workloads, measuring throughput metrics including tokens per second, MFU, and MBU.

Description

The benchmark_generators.py module profiles vLLM inference engine performance by simulating GRPO training generation workloads. It loads datasets using the same pipeline as grpo_fast.py, sets up Ray-based vLLM engines, and streams batches through them to measure performance. Key metrics include tokens per second, Model FLOPs Utilization (MFU), and Memory Bandwidth Utilization (MBU). Results are saved to CSV with git commit hashes for longitudinal tracking. The module also simulates weight synchronization between generation batches to profile the full GRPO training loop overhead.

Usage

Use this module for performance profiling of the vLLM generation pipeline. It helps identify bottlenecks between generation and weight sync, tune batch sizes, and validate MFU calculations for capacity planning of large-scale GRPO training runs.

Code Reference

Source Location

Repository: Allenai_Open_instruct
File: open_instruct/benchmark_generators.py
Lines: 1-735

Signature

def save_completion_lengths(batch_results: list[dict], timestamp: int, batch_idx: int) -> None:
    """Save completion lengths to CSV file."""

def save_config(args, tokenizer_config, model_config, streaming_config, timestamp: int) -> None:
    """Save benchmark configuration to JSON file."""

def save_benchmark_results_to_csv(results: list[dict[str, Any]], total_time: float,
    streaming_config, model_config) -> None:
    """Save results to CSV with git commit, batch sizes, and timing breakdowns."""

def free_all_gpu_memory(device: int | str = 0) -> None:
    """Aggressively clear PyTorch GPU caches before starting vLLM."""

def setup_dataset(args, streaming_config, tokenizer_config) -> datasets.Dataset:
    """Load dataset using same pipeline as grpo_fast.py."""

def setup_vllm_engines(...) -> tuple[list, ray_queue.Queue, ray_queue.Queue, ActorHandle]:
    """Create Ray actors with vLLM engines, reward config, and queues."""

def simulate_weight_sync(actor_manager, vllm_engines, args) -> float:
    """Simulate weight synchronization between batches, return elapsed time."""

def run_benchmark(dataset, vllm_engines, param_prompt_Q, inference_results_Q,
    actor_manager, ...) -> list[dict[str, Any]]:
    """Run the full benchmark: warmup, stream batches, collect results."""

def aggregate_results(results: list[dict[str, Any]]) -> dict[str, Any]:
    """Aggregate per-batch results into summary statistics."""

def print_summary(results, total_time, streaming_config, model_config, model_dims) -> None:
    """Display summary with percentiles and utilization metrics."""

def main() -> None:
    """CLI entry point for the benchmark."""

Import

# CLI script, run directly:
# python -m open_instruct.benchmark_generators --config_path <config.yaml>

I/O Contract

Inputs

Name	Type	Required	Description
config_path	str	Yes	Path to GRPO experiment config YAML
num_batches	int	No	Number of batches to benchmark (default: 10)
batch_size	int	No	Prompts per batch
dataset	datasets.Dataset	Auto	Loaded from config using GRPO pipeline

Outputs

Name	Type	Description
CSV results	File	Per-batch and aggregate performance metrics
JSON config	File	Benchmark configuration snapshot
Console summary	stdout	Percentile statistics, MFU, MBU, tokens/sec

Usage Examples

Running the Benchmark

# Run vLLM generator benchmark with a GRPO config
python -m open_instruct.benchmark_generators \
  --config_path configs/train_configs/grpo/default.yaml \
  --num_batches 20

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment