Workflow:Huggingface Open r1 Model Evaluation

Knowledge Sources	Open R1 LightEval Documentation vLLM Documentation
Domains	LLMs, Evaluation, Reasoning
Last Updated	2026-02-08 00:00 GMT

Overview

End-to-end process for evaluating trained reasoning models on standard benchmarks (AIME 2024, MATH-500, GPQA Diamond, LiveCodeBench) using LightEval with vLLM backend to estimate pass@1 accuracy.

Description

This workflow evaluates trained models against the same benchmarks used to assess DeepSeek-R1, enabling direct comparison. It uses LightEval integrated with vLLM for high-throughput inference with sampling-based evaluation. Multiple responses are generated per query (4-64 depending on benchmark) to estimate pass@1 accuracy, matching DeepSeek's evaluation methodology. The workflow supports single-GPU evaluation, data-parallel scaling across GPUs, and tensor-parallel sharding for large models.

Goal: Benchmark scores for the trained model on standard reasoning benchmarks, comparable to DeepSeek-R1 reported results.

Scope: From a trained model on the Hub to evaluation metrics on AIME 2024, MATH-500, GPQA Diamond, and LiveCodeBench.

Strategy: Uses LightEval with vLLM backend for efficient evaluation, with configurable parallelism strategies and sampling parameters matching DeepSeek's methodology.

Usage

Execute this workflow after completing SFT distillation or GRPO training to measure model performance against established benchmarks. This can be run standalone via CLI commands, triggered automatically via training callbacks (per-checkpoint evaluation), or launched as Slurm jobs. It is essential for comparing trained models against DeepSeek-R1 baselines and tracking training progress.

Execution Steps

Step 1: Benchmark_Selection

Choose which benchmarks to evaluate from the supported set: AIME 2024 (math competition), MATH-500 (general math), GPQA Diamond (graduate-level science), and LiveCodeBench (code generation). Each benchmark has a recommended number of responses per query for reliable pass@1 estimation: AIME uses 64 responses, MATH-500 uses 4, GPQA uses 8, and LiveCodeBench uses 16.

Key considerations:

More responses per query reduce variance but increase compute cost
AIME has only 30 problems, making high-sample evaluation critical for stability
Custom benchmarks can be registered via the LightEval task registry

Step 2: Parallelism_Configuration

Configure the evaluation parallelism strategy based on model size and available hardware. Small models (1.5B-7B) can run on a single GPU. Data parallelism splits the evaluation across multiple GPUs for throughput. Tensor parallelism shards the model across GPUs for large models (32B-70B) that do not fit in a single GPU's memory.

Key considerations:

Tensor parallelism requires VLLM_WORKER_MULTIPROC_METHOD=spawn
Data parallel evaluation provides linear throughput scaling
Models with 30B+ parameters automatically use tensor parallelism in the evaluation framework
GPU memory utilization is typically set to 0.8 to avoid OOM

Step 3: Model_and_Generation_Setup

Configure the model arguments for LightEval including model name/path, dtype (bfloat16), maximum model length (32768), and generation parameters (temperature 0.6, top-p 0.95, max_new_tokens 32768). The chat template is used for proper prompt formatting. These parameters match DeepSeek's evaluation configuration for comparable results.

Key considerations:

max_model_length of 32768 accommodates long reasoning traces
Temperature 0.6 and top-p 0.95 balance diversity and quality in sampling
Chat template must match the model's training format for correct evaluation

Step 4: Evaluation_Execution

Run LightEval with the configured model, task, and parallelism settings. The evaluation generates responses for each benchmark query, scores them against ground truth, and computes pass@1 accuracy. Results are saved to the output directory with detailed per-example scores. For Slurm clusters, evaluation jobs are submitted via the benchmark runner script.

Key considerations:

Evaluation can be launched via CLI (lighteval vllm), Makefile (make evaluate), or Python script
Slurm-based evaluation is integrated with the training callback system
Per-checkpoint evaluation enables tracking performance throughout training
Output directory structure follows data/evals/{model_name}/

Step 5: Results_Analysis

Collect and compare evaluation results against DeepSeek-R1 baselines. The evaluation produces detailed results that can be uploaded to the HuggingFace Hub for public comparison. Small differences from reported results are expected due to sampling variance (within 1-3 standard deviations).

Key considerations:

Upload evaluation details to the Hub using the upload_details script
Results variance is normal for sampling-based evaluation
Multiple evaluation runs can be aggregated for more stable estimates

Execution Diagram

GitHub URL

Workflow Repository