Principle:Alibaba ROLL RLVR Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, NLP |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A data preprocessing principle for transforming raw instruction-response datasets into tokenized, domain-tagged batches suitable for reinforcement learning training.
Description
RLVR Dataset Preparation handles the conversion of raw text datasets (JSON format with prompts and optional responses) into tokenized sequences ready for policy generation and reward computation. The process involves applying chat templates (e.g., Qwen2.5, ChatML) to format prompts correctly, tokenizing with the model's tokenizer, filtering by sequence length, and creating domain-aware batched dataloaders using stratified sampling.
The key challenge this addresses is multi-domain training: datasets from different domains (math, code, general reasoning) must be sampled according to configurable interleave probabilities while maintaining batch consistency for distributed training.
Usage
Use this principle when:
- Preparing multi-domain training data for RLVR pipelines
- Converting raw JSON datasets to tokenized format with chat templates
- Creating dataloaders with domain-stratified batching for balanced multi-domain training
Theoretical Basis
The preprocessing pipeline follows a standard NLP data pipeline with RL-specific additions:
- Chat Template Application: Raw prompts are wrapped in model-specific chat templates (system prompt + user message format) to match the model's expected input format
- Tokenization: Template-formatted text is tokenized to produce input_ids and attention_mask tensors
- Length Filtering: Sequences exceeding the maximum prompt length or below minimum length are filtered out
- Domain-Stratified Batching: A BatchStratifiedSampler ensures each batch contains samples from multiple domains according to configured interleave probabilities
Pseudo-code:
# Abstract data preparation flow
for domain in domains:
dataset = load_json(domain.data_path)
dataset = dataset.map(encode_with_chat_template)
dataset = dataset.filter(lambda x: min_len < len(x["input_ids"]) <= max_len)
dataloader = DataLoader(merged_dataset, sampler=BatchStratifiedSampler(domain_probs))
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle: