Implementation:Allenai Open instruct Benchmark Generators
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance |
| Last Updated | 2026-02-07 02:00 GMT |
Overview
Concrete tool for benchmarking vLLM generator performance in GRPO-like workloads, measuring throughput metrics including tokens per second, MFU, and MBU.
Description
The benchmark_generators.py module profiles vLLM inference engine performance by simulating GRPO training generation workloads. It loads datasets using the same pipeline as grpo_fast.py, sets up Ray-based vLLM engines, and streams batches through them to measure performance. Key metrics include tokens per second, Model FLOPs Utilization (MFU), and Memory Bandwidth Utilization (MBU). Results are saved to CSV with git commit hashes for longitudinal tracking. The module also simulates weight synchronization between generation batches to profile the full GRPO training loop overhead.
Usage
Use this module for performance profiling of the vLLM generation pipeline. It helps identify bottlenecks between generation and weight sync, tune batch sizes, and validate MFU calculations for capacity planning of large-scale GRPO training runs.
Code Reference
Source Location
- Repository: Allenai_Open_instruct
- File: open_instruct/benchmark_generators.py
- Lines: 1-735
Signature
def save_completion_lengths(batch_results: list[dict], timestamp: int, batch_idx: int) -> None:
"""Save completion lengths to CSV file."""
def save_config(args, tokenizer_config, model_config, streaming_config, timestamp: int) -> None:
"""Save benchmark configuration to JSON file."""
def save_benchmark_results_to_csv(results: list[dict[str, Any]], total_time: float,
streaming_config, model_config) -> None:
"""Save results to CSV with git commit, batch sizes, and timing breakdowns."""
def free_all_gpu_memory(device: int | str = 0) -> None:
"""Aggressively clear PyTorch GPU caches before starting vLLM."""
def setup_dataset(args, streaming_config, tokenizer_config) -> datasets.Dataset:
"""Load dataset using same pipeline as grpo_fast.py."""
def setup_vllm_engines(...) -> tuple[list, ray_queue.Queue, ray_queue.Queue, ActorHandle]:
"""Create Ray actors with vLLM engines, reward config, and queues."""
def simulate_weight_sync(actor_manager, vllm_engines, args) -> float:
"""Simulate weight synchronization between batches, return elapsed time."""
def run_benchmark(dataset, vllm_engines, param_prompt_Q, inference_results_Q,
actor_manager, ...) -> list[dict[str, Any]]:
"""Run the full benchmark: warmup, stream batches, collect results."""
def aggregate_results(results: list[dict[str, Any]]) -> dict[str, Any]:
"""Aggregate per-batch results into summary statistics."""
def print_summary(results, total_time, streaming_config, model_config, model_dims) -> None:
"""Display summary with percentiles and utilization metrics."""
def main() -> None:
"""CLI entry point for the benchmark."""
Import
# CLI script, run directly:
# python -m open_instruct.benchmark_generators --config_path <config.yaml>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config_path | str | Yes | Path to GRPO experiment config YAML |
| num_batches | int | No | Number of batches to benchmark (default: 10) |
| batch_size | int | No | Prompts per batch |
| dataset | datasets.Dataset | Auto | Loaded from config using GRPO pipeline |
Outputs
| Name | Type | Description |
|---|---|---|
| CSV results | File | Per-batch and aggregate performance metrics |
| JSON config | File | Benchmark configuration snapshot |
| Console summary | stdout | Percentile statistics, MFU, MBU, tokens/sec |
Usage Examples
Running the Benchmark
# Run vLLM generator benchmark with a GRPO config
python -m open_instruct.benchmark_generators \
--config_path configs/train_configs/grpo/default.yaml \
--num_batches 20