Implementation:Allenai Open instruct PreferenceDatasetProcessor

Component Type	Class
Source	`open_instruct/dataset_processor.py` (Lines 229-276)
Repository	Open Instruct
Dependencies	transformers, datasets
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for tokenizing and filtering preference datasets (chosen/rejected response pairs) provided by the Open Instruct library.

Description

PreferenceDatasetProcessor is a subclass of DatasetProcessor that handles the transformation of raw preference datasets into a tokenized format suitable for DPO and reward model training. It provides two primary methods:

tokenize(dataset) -- Applies a chat template to each example's chosen and rejected message lists, producing input_ids_prompt, input_ids_chosen, input_ids_rejected, and their corresponding attention masks. The prompt is extracted from the chosen messages (all messages except the last) and tokenized with add_generation_prompt=True.

filter(dataset) -- Removes examples whose tokenized sequences exceed configured maximum lengths (max_prompt_token_length for prompts, max_token_length for chosen/rejected sequences). Reports the percentage of filtered examples via logging.

Both methods leverage HuggingFace datasets.Dataset.map() and datasets.Dataset.filter() with configurable multiprocessing via num_proc.

Usage

Import and use PreferenceDatasetProcessor when preparing preference data for DPO training, reward model training, or any pipeline that requires tokenized chosen/rejected response pairs.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/dataset_processor.py (Lines 229-276)

Signature

class PreferenceDatasetProcessor(DatasetProcessor):
    def tokenize(self, dataset: Union[Dataset, DatasetDict]) -> Union[Dataset, DatasetDict]:
        """Apply chat templates to chosen/rejected pairs, producing token IDs and attention masks."""
        ...

    def filter(self, dataset: Union[Dataset, DatasetDict]) -> Union[Dataset, DatasetDict]:
        """Filter examples exceeding max_prompt_token_length or max_token_length."""
        ...

Import

from open_instruct.dataset_processor import PreferenceDatasetProcessor

I/O Contract

Inputs

Parameter	Type	Description
`dataset`	`Dataset` or `DatasetDict`	Raw preference dataset containing chosen and rejected message columns (configured via `config.preference_chosen_key` and `config.preference_rejected_key`). Each row's chosen/rejected fields are lists of chat messages.

Outputs (from `tokenize`)

Column Key	Type	Description
`input_ids_prompt`	`list[int]`	Tokenized prompt (chosen messages minus last message), with generation prompt appended.
`attention_mask_prompt`	`list[int]`	All-ones mask matching `input_ids_prompt` length.
`input_ids_chosen`	`list[int]`	Full tokenized chosen conversation (prompt + chosen response).
`attention_mask_chosen`	`list[int]`	All-ones mask matching `input_ids_chosen` length.
`input_ids_rejected`	`list[int]`	Full tokenized rejected conversation (prompt + rejected response).
`attention_mask_rejected`	`list[int]`	All-ones mask matching `input_ids_rejected` length.

Outputs (from `filter`)

Output	Type	Description
Filtered dataset	`Dataset` or `DatasetDict`	Subset of input where all sequences are within configured length limits.

Usage Examples

from open_instruct.dataset_processor import PreferenceDatasetProcessor

# Assume `config` is a DatasetProcessorConfig and `tokenizer` is a PreTrainedTokenizer
processor = PreferenceDatasetProcessor(config=config, tokenizer=tokenizer)

# Tokenize the preference dataset
tokenized_dataset = processor.tokenize(raw_dataset)

# Filter out examples exceeding max length
filtered_dataset = processor.filter(tokenized_dataset)

Related Pages

Implements Principle

Principle:Allenai_Open_instruct_Preference_Data_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment