Implementation:Huggingface Open r1 Run Benchmark Jobs
Metadata
| Field | Value |
|---|---|
| Sources | Repo: huggingface/open-r1 |
| Domains | NLP, Evaluation |
| Last Updated | 2026-02-08 00:00 GMT |
Principle:Huggingface_Open_r1_Benchmark_Evaluation
Overview
Concrete tool for submitting LightEval benchmark evaluation jobs to a Slurm cluster with automatic GPU allocation provided by Open-R1.
Description
The evaluation system consists of three functions:
register_lighteval_task— Populates theLIGHTEVAL_TASKSregistry with benchmark configurations. Each benchmark entry specifies the eval suite, task name, task list, and number of few-shot examples.run_lighteval_job— Constructs and submits a Slurmsbatchcommand for a single benchmark. It determines GPU count viaget_gpu_count_for_vllmandget_param_count_from_repo_id, builds the command-line arguments for LightEval, and submits the job.run_benchmark_jobs— Iterates over all requested benchmarks (fromtraining_args.benchmarks) and submits each by callingrun_lighteval_job.
Pre-registered benchmarks include:
| Benchmark | Suite | Description |
|---|---|---|
math_500 |
lighteval | MATH-500 mathematical reasoning |
aime24 |
lighteval | AIME 2024 competition problems |
aime25 |
lighteval | AIME 2025 competition problems |
gpqa |
lighteval | Graduate-level science questions |
lcb |
extended | LiveCodeBench coding problems |
lcb_v4 |
extended | LiveCodeBench v4 coding problems |
GPU allocation uses get_gpu_count_for_vllm and get_param_count_from_repo_id to determine tensor parallelism needs based on the model's parameter count and attention head configuration.
Usage
Import when you need to evaluate a trained model on standard benchmarks via Slurm. Typically invoked either directly after training completes or from a training callback for per-checkpoint evaluation.
Code Reference
Source Location
| Property | Value |
|---|---|
| Repository | open-r1 |
| File | src/open_r1/utils/evaluation.py
|
| Lines | L27-118 |
Signature
def run_benchmark_jobs(
training_args: Union["SFTConfig", "GRPOConfig"],
model_args: "ModelConfig",
) -> None:
...
def run_lighteval_job(
benchmark: str,
training_args: Union["SFTConfig", "GRPOConfig"],
model_args: "ModelConfig",
) -> None:
...
def register_lighteval_task(
configs: Dict[str, str],
eval_suite: str,
task_name: str,
task_list: str,
num_fewshot: int = 0,
) -> None:
...
Import
from open_r1.utils.evaluation import run_benchmark_jobs, run_lighteval_job
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
training_args |
SFTConfig or GRPOConfig |
Yes | Training configuration containing benchmarks (list of benchmark names to run), hub_model_id (Hub model identifier), hub_model_revision (model revision/checkpoint), and system_prompt (optional system prompt for evaluation).
|
model_args |
ModelConfig |
Yes | Model configuration containing model_name_or_path (used for GPU count computation) and trust_remote_code (whether to trust remote model code).
|
Outputs
| Output | Type | Description |
|---|---|---|
| Side effect | Slurm jobs | One Slurm job is submitted for each benchmark in training_args.benchmarks.
|
| Side effect | vLLM evaluation | Each Slurm job runs LightEval with vLLM-accelerated inference on the specified benchmark. |
| Side effect | Hub upload | Evaluation results are uploaded to the HuggingFace Hub under the model's repository. |
Usage Example
The following shows calling run_benchmark_jobs from a training callback for per-checkpoint evaluation:
from transformers import TrainerCallback
from open_r1.utils.evaluation import run_benchmark_jobs
class BenchmarkEvalCallback(TrainerCallback):
"""Callback that submits benchmark evaluation jobs after each checkpoint save."""
def __init__(self, training_args, model_args):
self.training_args = training_args
self.model_args = model_args
def on_save(self, args, state, control, **kwargs):
# Update the revision to point to the current checkpoint
self.training_args.hub_model_revision = f"checkpoint-{state.global_step}"
# Submit evaluation jobs for all configured benchmarks
run_benchmark_jobs(
training_args=self.training_args,
model_args=self.model_args,
)