Workflow:Huggingface Open r1 Model Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Evaluation, Reasoning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
End-to-end process for evaluating trained reasoning models on standard benchmarks (AIME 2024, MATH-500, GPQA Diamond, LiveCodeBench) using LightEval with vLLM backend to estimate pass@1 accuracy.
Description
This workflow evaluates trained models against the same benchmarks used to assess DeepSeek-R1, enabling direct comparison. It uses LightEval integrated with vLLM for high-throughput inference with sampling-based evaluation. Multiple responses are generated per query (4-64 depending on benchmark) to estimate pass@1 accuracy, matching DeepSeek's evaluation methodology. The workflow supports single-GPU evaluation, data-parallel scaling across GPUs, and tensor-parallel sharding for large models.
Goal: Benchmark scores for the trained model on standard reasoning benchmarks, comparable to DeepSeek-R1 reported results.
Scope: From a trained model on the Hub to evaluation metrics on AIME 2024, MATH-500, GPQA Diamond, and LiveCodeBench.
Strategy: Uses LightEval with vLLM backend for efficient evaluation, with configurable parallelism strategies and sampling parameters matching DeepSeek's methodology.
Usage
Execute this workflow after completing SFT distillation or GRPO training to measure model performance against established benchmarks. This can be run standalone via CLI commands, triggered automatically via training callbacks (per-checkpoint evaluation), or launched as Slurm jobs. It is essential for comparing trained models against DeepSeek-R1 baselines and tracking training progress.
Execution Steps
Step 1: Benchmark_Selection
Choose which benchmarks to evaluate from the supported set: AIME 2024 (math competition), MATH-500 (general math), GPQA Diamond (graduate-level science), and LiveCodeBench (code generation). Each benchmark has a recommended number of responses per query for reliable pass@1 estimation: AIME uses 64 responses, MATH-500 uses 4, GPQA uses 8, and LiveCodeBench uses 16.
Key considerations:
- More responses per query reduce variance but increase compute cost
- AIME has only 30 problems, making high-sample evaluation critical for stability
- Custom benchmarks can be registered via the LightEval task registry
Step 2: Parallelism_Configuration
Configure the evaluation parallelism strategy based on model size and available hardware. Small models (1.5B-7B) can run on a single GPU. Data parallelism splits the evaluation across multiple GPUs for throughput. Tensor parallelism shards the model across GPUs for large models (32B-70B) that do not fit in a single GPU's memory.
Key considerations:
- Tensor parallelism requires VLLM_WORKER_MULTIPROC_METHOD=spawn
- Data parallel evaluation provides linear throughput scaling
- Models with 30B+ parameters automatically use tensor parallelism in the evaluation framework
- GPU memory utilization is typically set to 0.8 to avoid OOM
Step 3: Model_and_Generation_Setup
Configure the model arguments for LightEval including model name/path, dtype (bfloat16), maximum model length (32768), and generation parameters (temperature 0.6, top-p 0.95, max_new_tokens 32768). The chat template is used for proper prompt formatting. These parameters match DeepSeek's evaluation configuration for comparable results.
Key considerations:
- max_model_length of 32768 accommodates long reasoning traces
- Temperature 0.6 and top-p 0.95 balance diversity and quality in sampling
- Chat template must match the model's training format for correct evaluation
Step 4: Evaluation_Execution
Run LightEval with the configured model, task, and parallelism settings. The evaluation generates responses for each benchmark query, scores them against ground truth, and computes pass@1 accuracy. Results are saved to the output directory with detailed per-example scores. For Slurm clusters, evaluation jobs are submitted via the benchmark runner script.
Key considerations:
- Evaluation can be launched via CLI (lighteval vllm), Makefile (make evaluate), or Python script
- Slurm-based evaluation is integrated with the training callback system
- Per-checkpoint evaluation enables tracking performance throughout training
- Output directory structure follows data/evals/{model_name}/
Step 5: Results_Analysis
Collect and compare evaluation results against DeepSeek-R1 baselines. The evaluation produces detailed results that can be uploaded to the HuggingFace Hub for public comparison. Small differences from reported results are expected due to sampling variance (within 1-3 standard deviations).
Key considerations:
- Upload evaluation details to the Hub using the upload_details script
- Results variance is normal for sampling-based evaluation
- Multiple evaluation runs can be aggregated for more stable estimates