Principle:Alibaba ROLL RLVR Dataset Preparation

Knowledge Sources	HuggingFace Datasets Chat Templates Alibaba ROLL
Domains	Data_Processing, NLP
Last Updated	2026-02-07 20:00 GMT

Overview

A data preprocessing principle for transforming raw instruction-response datasets into tokenized, domain-tagged batches suitable for reinforcement learning training.

Description

RLVR Dataset Preparation handles the conversion of raw text datasets (JSON format with prompts and optional responses) into tokenized sequences ready for policy generation and reward computation. The process involves applying chat templates (e.g., Qwen2.5, ChatML) to format prompts correctly, tokenizing with the model's tokenizer, filtering by sequence length, and creating domain-aware batched dataloaders using stratified sampling.

The key challenge this addresses is multi-domain training: datasets from different domains (math, code, general reasoning) must be sampled according to configurable interleave probabilities while maintaining batch consistency for distributed training.

Usage

Use this principle when:

Preparing multi-domain training data for RLVR pipelines
Converting raw JSON datasets to tokenized format with chat templates
Creating dataloaders with domain-stratified batching for balanced multi-domain training

Theoretical Basis

The preprocessing pipeline follows a standard NLP data pipeline with RL-specific additions:

Chat Template Application: Raw prompts are wrapped in model-specific chat templates (system prompt + user message format) to match the model's expected input format
Tokenization: Template-formatted text is tokenized to produce input_ids and attention_mask tensors
Length Filtering: Sequences exceeding the maximum prompt length or below minimum length are filtered out
Domain-Stratified Batching: A BatchStratifiedSampler ensures each batch contains samples from multiple domains according to configured interleave probabilities

Pseudo-code:

# Abstract data preparation flow
for domain in domains:
    dataset = load_json(domain.data_path)
    dataset = dataset.map(encode_with_chat_template)
    dataset = dataset.filter(lambda x: min_len < len(x["input_ids"]) <= max_len)
dataloader = DataLoader(merged_dataset, sampler=BatchStratifiedSampler(domain_probs))

Related Pages

Implemented By

Implementation:Alibaba_ROLL_RLVR_Preprocess_Dataset

Related Heuristics

The following heuristics inform this principle:

Heuristic:Alibaba_ROLL_Dynamic_Batching_Token_Limits

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment