Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL RLVR Preprocess Dataset

From Leeroopedia


Knowledge Sources
Domains Data_Processing, NLP
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete dataset preprocessing functions for RLVR training pipelines provided by the Alibaba ROLL library.

Description

The preprocess_dataset and get_encode_function functions handle the conversion of raw text datasets into tokenized sequences. get_encode_function creates a callable that applies chat templates and tokenizes text. preprocess_dataset maps this encoding function over the dataset and filters by sequence length. The pipeline also uses BatchStratifiedSampler for domain-aware batching.

Usage

Import these functions when building the data pipeline for an RLVR training run. They are called during pipeline initialization to prepare the training dataloader.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/rlvr/rlvr_pipeline.py
  • Lines: L102-200

Signature

def preprocess_dataset(
    dataset: datasets.Dataset,
    prompt_len: int,
    encode_function: Callable,
    data_args
) -> datasets.Dataset:
    """
    Preprocess dataset by encoding and filtering.

    Args:
        dataset: HuggingFace dataset to preprocess
        prompt_len: Maximum prompt length for filtering
        encode_function: Function to encode text data (from get_encode_function)
        data_args: Data-related configuration arguments

    Returns:
        Preprocessed dataset with encoded input_ids, filtered by length
    """

def get_encode_function(
    template_name: str,
    tokenizer,
    data_args
) -> Callable:
    """
    Create encoding function for chat templates.

    Args:
        template_name: Name of chat template (native/qwen2_5/chatml)
        tokenizer: Tokenizer instance
        data_args: Data configuration with prompt_key, response_key, etc.

    Returns:
        Callable that encodes text using the chat template and tokenizer
    """

Import

from roll.pipeline.rlvr.rlvr_pipeline import preprocess_dataset, get_encode_function
from roll.datasets.chat_template import get_chat_template

I/O Contract

Inputs

Name Type Required Description
dataset datasets.Dataset Yes Raw HuggingFace dataset loaded from JSON
prompt_len int Yes Maximum prompt token length for filtering
encode_function Callable Yes Encoding function from get_encode_function
template_name str Yes Chat template name (native/qwen2_5/chatml)
tokenizer PreTrainedTokenizer Yes Model tokenizer for text encoding

Outputs

Name Type Description
Preprocessed dataset datasets.Dataset Dataset with input_ids, attention_mask columns
DataLoader torch.utils.data.DataLoader Batched dataloader with BatchStratifiedSampler

Usage Examples

Basic Dataset Preparation

from roll.pipeline.rlvr.rlvr_pipeline import preprocess_dataset, get_encode_function
from roll.datasets.chat_template import get_chat_template
from roll.models.model_providers import default_tokenizer_provider
import datasets

# 1. Load tokenizer
tokenizer = default_tokenizer_provider("Qwen/Qwen2.5-7B")

# 2. Create encode function with chat template
encode_fn = get_encode_function(
    template_name="qwen2_5",
    tokenizer=tokenizer,
    data_args=config  # Contains prompt_key, response_key, etc.
)

# 3. Load and preprocess dataset
dataset = datasets.load_dataset("json", data_files="math_train.json", split="train")
processed = preprocess_dataset(
    dataset=dataset,
    prompt_len=1024,
    encode_function=encode_fn,
    data_args=config
)

# 4. Create dataloader with domain-stratified batching
from roll.datasets.dataset import BatchStratifiedSampler
sampler = BatchStratifiedSampler(processed, batch_size=32, domain_probs={"math": 0.5, "code": 0.3, "general": 0.2})
dataloader = DataLoader(processed, batch_sampler=sampler)

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment