Implementation:Huggingface Open r1 Run Benchmark Jobs

Metadata

Field	Value
Sources	Repo: huggingface/open-r1
Domains	NLP, Evaluation
Last Updated	2026-02-08 00:00 GMT

Principle:Huggingface_Open_r1_Benchmark_Evaluation

Overview

Concrete tool for submitting LightEval benchmark evaluation jobs to a Slurm cluster with automatic GPU allocation provided by Open-R1.

Description

The evaluation system consists of three functions:

register_lighteval_task — Populates the LIGHTEVAL_TASKS registry with benchmark configurations. Each benchmark entry specifies the eval suite, task name, task list, and number of few-shot examples.
run_lighteval_job — Constructs and submits a Slurm sbatch command for a single benchmark. It determines GPU count via get_gpu_count_for_vllm and get_param_count_from_repo_id, builds the command-line arguments for LightEval, and submits the job.
run_benchmark_jobs — Iterates over all requested benchmarks (from training_args.benchmarks) and submits each by calling run_lighteval_job.

Pre-registered benchmarks include:

Benchmark	Suite	Description
`math_500`	lighteval	MATH-500 mathematical reasoning
`aime24`	lighteval	AIME 2024 competition problems
`aime25`	lighteval	AIME 2025 competition problems
`gpqa`	lighteval	Graduate-level science questions
`lcb`	extended	LiveCodeBench coding problems
`lcb_v4`	extended	LiveCodeBench v4 coding problems

GPU allocation uses get_gpu_count_for_vllm and get_param_count_from_repo_id to determine tensor parallelism needs based on the model's parameter count and attention head configuration.

Usage

Import when you need to evaluate a trained model on standard benchmarks via Slurm. Typically invoked either directly after training completes or from a training callback for per-checkpoint evaluation.

Code Reference

Source Location

Property	Value
Repository	open-r1
File	`src/open_r1/utils/evaluation.py`
Lines	L27-118

Signature

def run_benchmark_jobs(
    training_args: Union["SFTConfig", "GRPOConfig"],
    model_args: "ModelConfig",
) -> None:
    ...

def run_lighteval_job(
    benchmark: str,
    training_args: Union["SFTConfig", "GRPOConfig"],
    model_args: "ModelConfig",
) -> None:
    ...

def register_lighteval_task(
    configs: Dict[str, str],
    eval_suite: str,
    task_name: str,
    task_list: str,
    num_fewshot: int = 0,
) -> None:
    ...

Import

from open_r1.utils.evaluation import run_benchmark_jobs, run_lighteval_job

I/O Contract

Inputs

Parameter	Type	Required	Description
`training_args`	`SFTConfig` or `GRPOConfig`	Yes	Training configuration containing `benchmarks` (list of benchmark names to run), `hub_model_id` (Hub model identifier), `hub_model_revision` (model revision/checkpoint), and `system_prompt` (optional system prompt for evaluation).
`model_args`	`ModelConfig`	Yes	Model configuration containing `model_name_or_path` (used for GPU count computation) and `trust_remote_code` (whether to trust remote model code).

Outputs

Output	Type	Description
Side effect	Slurm jobs	One Slurm job is submitted for each benchmark in `training_args.benchmarks`.
Side effect	vLLM evaluation	Each Slurm job runs LightEval with vLLM-accelerated inference on the specified benchmark.
Side effect	Hub upload	Evaluation results are uploaded to the HuggingFace Hub under the model's repository.

Usage Example

The following shows calling run_benchmark_jobs from a training callback for per-checkpoint evaluation:

from transformers import TrainerCallback
from open_r1.utils.evaluation import run_benchmark_jobs


class BenchmarkEvalCallback(TrainerCallback):
    """Callback that submits benchmark evaluation jobs after each checkpoint save."""

    def __init__(self, training_args, model_args):
        self.training_args = training_args
        self.model_args = model_args

    def on_save(self, args, state, control, **kwargs):
        # Update the revision to point to the current checkpoint
        self.training_args.hub_model_revision = f"checkpoint-{state.global_step}"

        # Submit evaluation jobs for all configured benchmarks
        run_benchmark_jobs(
            training_args=self.training_args,
            model_args=self.model_args,
        )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment