Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Benchmark Generators

From Leeroopedia


Knowledge Sources
Domains Benchmarking, Performance
Last Updated 2026-02-07 02:00 GMT

Overview

Concrete tool for benchmarking vLLM generator performance in GRPO-like workloads, measuring throughput metrics including tokens per second, MFU, and MBU.

Description

The benchmark_generators.py module profiles vLLM inference engine performance by simulating GRPO training generation workloads. It loads datasets using the same pipeline as grpo_fast.py, sets up Ray-based vLLM engines, and streams batches through them to measure performance. Key metrics include tokens per second, Model FLOPs Utilization (MFU), and Memory Bandwidth Utilization (MBU). Results are saved to CSV with git commit hashes for longitudinal tracking. The module also simulates weight synchronization between generation batches to profile the full GRPO training loop overhead.

Usage

Use this module for performance profiling of the vLLM generation pipeline. It helps identify bottlenecks between generation and weight sync, tune batch sizes, and validate MFU calculations for capacity planning of large-scale GRPO training runs.

Code Reference

Source Location

Signature

def save_completion_lengths(batch_results: list[dict], timestamp: int, batch_idx: int) -> None:
    """Save completion lengths to CSV file."""

def save_config(args, tokenizer_config, model_config, streaming_config, timestamp: int) -> None:
    """Save benchmark configuration to JSON file."""

def save_benchmark_results_to_csv(results: list[dict[str, Any]], total_time: float,
    streaming_config, model_config) -> None:
    """Save results to CSV with git commit, batch sizes, and timing breakdowns."""

def free_all_gpu_memory(device: int | str = 0) -> None:
    """Aggressively clear PyTorch GPU caches before starting vLLM."""

def setup_dataset(args, streaming_config, tokenizer_config) -> datasets.Dataset:
    """Load dataset using same pipeline as grpo_fast.py."""

def setup_vllm_engines(...) -> tuple[list, ray_queue.Queue, ray_queue.Queue, ActorHandle]:
    """Create Ray actors with vLLM engines, reward config, and queues."""

def simulate_weight_sync(actor_manager, vllm_engines, args) -> float:
    """Simulate weight synchronization between batches, return elapsed time."""

def run_benchmark(dataset, vllm_engines, param_prompt_Q, inference_results_Q,
    actor_manager, ...) -> list[dict[str, Any]]:
    """Run the full benchmark: warmup, stream batches, collect results."""

def aggregate_results(results: list[dict[str, Any]]) -> dict[str, Any]:
    """Aggregate per-batch results into summary statistics."""

def print_summary(results, total_time, streaming_config, model_config, model_dims) -> None:
    """Display summary with percentiles and utilization metrics."""

def main() -> None:
    """CLI entry point for the benchmark."""

Import

# CLI script, run directly:
# python -m open_instruct.benchmark_generators --config_path <config.yaml>

I/O Contract

Inputs

Name Type Required Description
config_path str Yes Path to GRPO experiment config YAML
num_batches int No Number of batches to benchmark (default: 10)
batch_size int No Prompts per batch
dataset datasets.Dataset Auto Loaded from config using GRPO pipeline

Outputs

Name Type Description
CSV results File Per-batch and aggregate performance metrics
JSON config File Benchmark configuration snapshot
Console summary stdout Percentile statistics, MFU, MBU, tokens/sec

Usage Examples

Running the Benchmark

# Run vLLM generator benchmark with a GRPO config
python -m open_instruct.benchmark_generators \
  --config_path configs/train_configs/grpo/default.yaml \
  --num_batches 20

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment