Principle:Allenai Open instruct RLVR Data Loading
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning Data Engineering |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
RLVR data loading is the process of preparing prompt-only datasets with associated ground truth labels for use in Reinforcement Learning from Verifiable Rewards training.
Description
Unlike supervised fine-tuning (SFT) where each example contains both a prompt and a target completion, RLVR training uses prompt-only datasets. Each example consists of:
- A prompt (the question or task) that the model will generate completions for.
- A ground truth label used by a verifier to score the model's generated responses.
- A verifier source identifier that determines which verification function to apply (e.g., math equivalence checking, code execution, instruction-following evaluation).
The data loading pipeline must handle several concerns:
- Dataset mixing: Multiple datasets can be combined with configurable proportions (e.g., 50% math, 30% code, 20% instruction-following).
- Tokenization: Prompts must be tokenized using the model's chat template, optionally injecting system prompts and tool definitions.
- Length filtering: Prompts exceeding a maximum token length are filtered out to prevent wasted computation during generation.
- Caching: Processed datasets are cached (locally or on HuggingFace Hub) to avoid reprocessing on subsequent runs.
- Shuffling: The training dataset is shuffled with a deterministic seed for reproducibility.
Usage
RLVR data loading is the first step in any GRPO training run. It is called once during initialization to produce the training and evaluation datasets that feed into the generation and reward pipeline. This approach is appropriate whenever the reward signal can be computed by comparing model outputs against ground truth labels, rather than requiring a learned reward model.
Theoretical Basis
RLVR stands apart from RLHF (Reinforcement Learning from Human Feedback) in that it uses verifiable rewards rather than learned reward models. The theoretical advantage is that verifiable rewards provide:
- No reward hacking: The reward signal is derived from objective correctness (e.g., mathematical equivalence), so the model cannot exploit reward model weaknesses.
- Scalability: No human annotation is required for reward computation.
- Stability: The reward function is stationary, unlike a learned reward model that can drift during training.
The dataset preparation pipeline follows this structure:
For each dataset in the mixer:
1. Load raw examples from HuggingFace
2. Apply chat template to convert prompts to token sequences
3. Filter examples exceeding max_prompt_token_length
4. Attach ground_truth and verifier_source metadata
5. Cache the processed dataset
train_dataset = shuffle(concatenate(processed_datasets), seed)
eval_dataset = load_and_process(eval_mixer)
The separation of train and eval datasets with independent mixers allows for evaluating on different distributions (e.g., training on GSM8K but evaluating on MATH).