Implementation:Alibaba ROLL RLVR Preprocess Dataset
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, NLP |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete dataset preprocessing functions for RLVR training pipelines provided by the Alibaba ROLL library.
Description
The preprocess_dataset and get_encode_function functions handle the conversion of raw text datasets into tokenized sequences. get_encode_function creates a callable that applies chat templates and tokenizes text. preprocess_dataset maps this encoding function over the dataset and filters by sequence length. The pipeline also uses BatchStratifiedSampler for domain-aware batching.
Usage
Import these functions when building the data pipeline for an RLVR training run. They are called during pipeline initialization to prepare the training dataloader.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/rlvr/rlvr_pipeline.py
- Lines: L102-200
Signature
def preprocess_dataset(
dataset: datasets.Dataset,
prompt_len: int,
encode_function: Callable,
data_args
) -> datasets.Dataset:
"""
Preprocess dataset by encoding and filtering.
Args:
dataset: HuggingFace dataset to preprocess
prompt_len: Maximum prompt length for filtering
encode_function: Function to encode text data (from get_encode_function)
data_args: Data-related configuration arguments
Returns:
Preprocessed dataset with encoded input_ids, filtered by length
"""
def get_encode_function(
template_name: str,
tokenizer,
data_args
) -> Callable:
"""
Create encoding function for chat templates.
Args:
template_name: Name of chat template (native/qwen2_5/chatml)
tokenizer: Tokenizer instance
data_args: Data configuration with prompt_key, response_key, etc.
Returns:
Callable that encodes text using the chat template and tokenizer
"""
Import
from roll.pipeline.rlvr.rlvr_pipeline import preprocess_dataset, get_encode_function
from roll.datasets.chat_template import get_chat_template
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | datasets.Dataset | Yes | Raw HuggingFace dataset loaded from JSON |
| prompt_len | int | Yes | Maximum prompt token length for filtering |
| encode_function | Callable | Yes | Encoding function from get_encode_function |
| template_name | str | Yes | Chat template name (native/qwen2_5/chatml) |
| tokenizer | PreTrainedTokenizer | Yes | Model tokenizer for text encoding |
Outputs
| Name | Type | Description |
|---|---|---|
| Preprocessed dataset | datasets.Dataset | Dataset with input_ids, attention_mask columns |
| DataLoader | torch.utils.data.DataLoader | Batched dataloader with BatchStratifiedSampler |
Usage Examples
Basic Dataset Preparation
from roll.pipeline.rlvr.rlvr_pipeline import preprocess_dataset, get_encode_function
from roll.datasets.chat_template import get_chat_template
from roll.models.model_providers import default_tokenizer_provider
import datasets
# 1. Load tokenizer
tokenizer = default_tokenizer_provider("Qwen/Qwen2.5-7B")
# 2. Create encode function with chat template
encode_fn = get_encode_function(
template_name="qwen2_5",
tokenizer=tokenizer,
data_args=config # Contains prompt_key, response_key, etc.
)
# 3. Load and preprocess dataset
dataset = datasets.load_dataset("json", data_files="math_train.json", split="train")
processed = preprocess_dataset(
dataset=dataset,
prompt_len=1024,
encode_function=encode_fn,
data_args=config
)
# 4. Create dataloader with domain-stratified batching
from roll.datasets.dataset import BatchStratifiedSampler
sampler = BatchStratifiedSampler(processed, batch_size=32, domain_probs={"math": 0.5, "code": 0.3, "general": 0.2})
dataloader = DataLoader(processed, batch_sampler=sampler)
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
This implementation uses the following heuristics: