Implementation:Alibaba ROLL RLVR Preprocess Dataset

Knowledge Sources	Alibaba ROLL HuggingFace Datasets
Domains	Data_Processing, NLP
Last Updated	2026-02-07 20:00 GMT

Overview

Concrete dataset preprocessing functions for RLVR training pipelines provided by the Alibaba ROLL library.

Description

The preprocess_dataset and get_encode_function functions handle the conversion of raw text datasets into tokenized sequences. get_encode_function creates a callable that applies chat templates and tokenizes text. preprocess_dataset maps this encoding function over the dataset and filters by sequence length. The pipeline also uses BatchStratifiedSampler for domain-aware batching.

Usage

Import these functions when building the data pipeline for an RLVR training run. They are called during pipeline initialization to prepare the training dataloader.

Code Reference

Source Location

Repository: Alibaba ROLL
File: roll/pipeline/rlvr/rlvr_pipeline.py
Lines: L102-200

Signature

def preprocess_dataset(
    dataset: datasets.Dataset,
    prompt_len: int,
    encode_function: Callable,
    data_args
) -> datasets.Dataset:
    """
    Preprocess dataset by encoding and filtering.

    Args:
        dataset: HuggingFace dataset to preprocess
        prompt_len: Maximum prompt length for filtering
        encode_function: Function to encode text data (from get_encode_function)
        data_args: Data-related configuration arguments

    Returns:
        Preprocessed dataset with encoded input_ids, filtered by length
    """

def get_encode_function(
    template_name: str,
    tokenizer,
    data_args
) -> Callable:
    """
    Create encoding function for chat templates.

    Args:
        template_name: Name of chat template (native/qwen2_5/chatml)
        tokenizer: Tokenizer instance
        data_args: Data configuration with prompt_key, response_key, etc.

    Returns:
        Callable that encodes text using the chat template and tokenizer
    """

Import

from roll.pipeline.rlvr.rlvr_pipeline import preprocess_dataset, get_encode_function
from roll.datasets.chat_template import get_chat_template

I/O Contract

Inputs

Name	Type	Required	Description
dataset	datasets.Dataset	Yes	Raw HuggingFace dataset loaded from JSON
prompt_len	int	Yes	Maximum prompt token length for filtering
encode_function	Callable	Yes	Encoding function from get_encode_function
template_name	str	Yes	Chat template name (native/qwen2_5/chatml)
tokenizer	PreTrainedTokenizer	Yes	Model tokenizer for text encoding

Outputs

Name	Type	Description
Preprocessed dataset	datasets.Dataset	Dataset with input_ids, attention_mask columns
DataLoader	torch.utils.data.DataLoader	Batched dataloader with BatchStratifiedSampler

Usage Examples

Basic Dataset Preparation

from roll.pipeline.rlvr.rlvr_pipeline import preprocess_dataset, get_encode_function
from roll.datasets.chat_template import get_chat_template
from roll.models.model_providers import default_tokenizer_provider
import datasets

# 1. Load tokenizer
tokenizer = default_tokenizer_provider("Qwen/Qwen2.5-7B")

# 2. Create encode function with chat template
encode_fn = get_encode_function(
    template_name="qwen2_5",
    tokenizer=tokenizer,
    data_args=config  # Contains prompt_key, response_key, etc.
)

# 3. Load and preprocess dataset
dataset = datasets.load_dataset("json", data_files="math_train.json", split="train")
processed = preprocess_dataset(
    dataset=dataset,
    prompt_len=1024,
    encode_function=encode_fn,
    data_args=config
)

# 4. Create dataloader with domain-stratified batching
from roll.datasets.dataset import BatchStratifiedSampler
sampler = BatchStratifiedSampler(processed, batch_size=32, domain_probs={"math": 0.5, "code": 0.3, "general": 0.2})
dataloader = DataLoader(processed, batch_sampler=sampler)

Related Pages

Implements Principle

Principle:Alibaba_ROLL_RLVR_Dataset_Preparation

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Environment:Alibaba_ROLL_Python_Runtime_Environment

Heuristics Applied

This implementation uses the following heuristics:

Heuristic:Alibaba_ROLL_Dynamic_Batching_Token_Limits

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment