Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Open r1 Run Benchmark Jobs

From Leeroopedia


Metadata

Field Value
Sources Repo: huggingface/open-r1
Domains NLP, Evaluation
Last Updated 2026-02-08 00:00 GMT

Principle:Huggingface_Open_r1_Benchmark_Evaluation

Overview

Concrete tool for submitting LightEval benchmark evaluation jobs to a Slurm cluster with automatic GPU allocation provided by Open-R1.

Description

The evaluation system consists of three functions:

  1. register_lighteval_task — Populates the LIGHTEVAL_TASKS registry with benchmark configurations. Each benchmark entry specifies the eval suite, task name, task list, and number of few-shot examples.
  2. run_lighteval_job — Constructs and submits a Slurm sbatch command for a single benchmark. It determines GPU count via get_gpu_count_for_vllm and get_param_count_from_repo_id, builds the command-line arguments for LightEval, and submits the job.
  3. run_benchmark_jobs — Iterates over all requested benchmarks (from training_args.benchmarks) and submits each by calling run_lighteval_job.

Pre-registered benchmarks include:

Benchmark Suite Description
math_500 lighteval MATH-500 mathematical reasoning
aime24 lighteval AIME 2024 competition problems
aime25 lighteval AIME 2025 competition problems
gpqa lighteval Graduate-level science questions
lcb extended LiveCodeBench coding problems
lcb_v4 extended LiveCodeBench v4 coding problems

GPU allocation uses get_gpu_count_for_vllm and get_param_count_from_repo_id to determine tensor parallelism needs based on the model's parameter count and attention head configuration.

Usage

Import when you need to evaluate a trained model on standard benchmarks via Slurm. Typically invoked either directly after training completes or from a training callback for per-checkpoint evaluation.

Code Reference

Source Location

Property Value
Repository open-r1
File src/open_r1/utils/evaluation.py
Lines L27-118

Signature

def run_benchmark_jobs(
    training_args: Union["SFTConfig", "GRPOConfig"],
    model_args: "ModelConfig",
) -> None:
    ...

def run_lighteval_job(
    benchmark: str,
    training_args: Union["SFTConfig", "GRPOConfig"],
    model_args: "ModelConfig",
) -> None:
    ...

def register_lighteval_task(
    configs: Dict[str, str],
    eval_suite: str,
    task_name: str,
    task_list: str,
    num_fewshot: int = 0,
) -> None:
    ...

Import

from open_r1.utils.evaluation import run_benchmark_jobs, run_lighteval_job

I/O Contract

Inputs

Parameter Type Required Description
training_args SFTConfig or GRPOConfig Yes Training configuration containing benchmarks (list of benchmark names to run), hub_model_id (Hub model identifier), hub_model_revision (model revision/checkpoint), and system_prompt (optional system prompt for evaluation).
model_args ModelConfig Yes Model configuration containing model_name_or_path (used for GPU count computation) and trust_remote_code (whether to trust remote model code).

Outputs

Output Type Description
Side effect Slurm jobs One Slurm job is submitted for each benchmark in training_args.benchmarks.
Side effect vLLM evaluation Each Slurm job runs LightEval with vLLM-accelerated inference on the specified benchmark.
Side effect Hub upload Evaluation results are uploaded to the HuggingFace Hub under the model's repository.

Usage Example

The following shows calling run_benchmark_jobs from a training callback for per-checkpoint evaluation:

from transformers import TrainerCallback
from open_r1.utils.evaluation import run_benchmark_jobs


class BenchmarkEvalCallback(TrainerCallback):
    """Callback that submits benchmark evaluation jobs after each checkpoint save."""

    def __init__(self, training_args, model_args):
        self.training_args = training_args
        self.model_args = model_args

    def on_save(self, args, state, control, **kwargs):
        # Update the revision to point to the current checkpoint
        self.training_args.hub_model_revision = f"checkpoint-{state.global_step}"

        # Submit evaluation jobs for all configured benchmarks
        run_benchmark_jobs(
            training_args=self.training_args,
            model_args=self.model_args,
        )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment