Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL RLVR Dataset Preparation

From Leeroopedia


Knowledge Sources
Domains Data_Processing, NLP
Last Updated 2026-02-07 20:00 GMT

Overview

A data preprocessing principle for transforming raw instruction-response datasets into tokenized, domain-tagged batches suitable for reinforcement learning training.

Description

RLVR Dataset Preparation handles the conversion of raw text datasets (JSON format with prompts and optional responses) into tokenized sequences ready for policy generation and reward computation. The process involves applying chat templates (e.g., Qwen2.5, ChatML) to format prompts correctly, tokenizing with the model's tokenizer, filtering by sequence length, and creating domain-aware batched dataloaders using stratified sampling.

The key challenge this addresses is multi-domain training: datasets from different domains (math, code, general reasoning) must be sampled according to configurable interleave probabilities while maintaining batch consistency for distributed training.

Usage

Use this principle when:

  • Preparing multi-domain training data for RLVR pipelines
  • Converting raw JSON datasets to tokenized format with chat templates
  • Creating dataloaders with domain-stratified batching for balanced multi-domain training

Theoretical Basis

The preprocessing pipeline follows a standard NLP data pipeline with RL-specific additions:

  1. Chat Template Application: Raw prompts are wrapped in model-specific chat templates (system prompt + user message format) to match the model's expected input format
  2. Tokenization: Template-formatted text is tokenized to produce input_ids and attention_mask tensors
  3. Length Filtering: Sequences exceeding the maximum prompt length or below minimum length are filtered out
  4. Domain-Stratified Batching: A BatchStratifiedSampler ensures each batch contains samples from multiple domains according to configured interleave probabilities

Pseudo-code:

# Abstract data preparation flow
for domain in domains:
    dataset = load_json(domain.data_path)
    dataset = dataset.map(encode_with_chat_template)
    dataset = dataset.filter(lambda x: min_len < len(x["input_ids"]) <= max_len)
dataloader = DataLoader(merged_dataset, sampler=BatchStratifiedSampler(domain_probs))

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment