Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct PreferenceDatasetProcessor

From Leeroopedia


Component Type Class
Source open_instruct/dataset_processor.py (Lines 229-276)
Repository Open Instruct
Dependencies transformers, datasets
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for tokenizing and filtering preference datasets (chosen/rejected response pairs) provided by the Open Instruct library.

Description

PreferenceDatasetProcessor is a subclass of DatasetProcessor that handles the transformation of raw preference datasets into a tokenized format suitable for DPO and reward model training. It provides two primary methods:

  • tokenize(dataset) -- Applies a chat template to each example's chosen and rejected message lists, producing input_ids_prompt, input_ids_chosen, input_ids_rejected, and their corresponding attention masks. The prompt is extracted from the chosen messages (all messages except the last) and tokenized with add_generation_prompt=True.
  • filter(dataset) -- Removes examples whose tokenized sequences exceed configured maximum lengths (max_prompt_token_length for prompts, max_token_length for chosen/rejected sequences). Reports the percentage of filtered examples via logging.

Both methods leverage HuggingFace datasets.Dataset.map() and datasets.Dataset.filter() with configurable multiprocessing via num_proc.

Usage

Import and use PreferenceDatasetProcessor when preparing preference data for DPO training, reward model training, or any pipeline that requires tokenized chosen/rejected response pairs.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/dataset_processor.py (Lines 229-276)

Signature

class PreferenceDatasetProcessor(DatasetProcessor):
    def tokenize(self, dataset: Union[Dataset, DatasetDict]) -> Union[Dataset, DatasetDict]:
        """Apply chat templates to chosen/rejected pairs, producing token IDs and attention masks."""
        ...

    def filter(self, dataset: Union[Dataset, DatasetDict]) -> Union[Dataset, DatasetDict]:
        """Filter examples exceeding max_prompt_token_length or max_token_length."""
        ...

Import

from open_instruct.dataset_processor import PreferenceDatasetProcessor

I/O Contract

Inputs

Parameter Type Description
dataset Dataset or DatasetDict Raw preference dataset containing chosen and rejected message columns (configured via config.preference_chosen_key and config.preference_rejected_key). Each row's chosen/rejected fields are lists of chat messages.

Outputs (from tokenize)

Column Key Type Description
input_ids_prompt list[int] Tokenized prompt (chosen messages minus last message), with generation prompt appended.
attention_mask_prompt list[int] All-ones mask matching input_ids_prompt length.
input_ids_chosen list[int] Full tokenized chosen conversation (prompt + chosen response).
attention_mask_chosen list[int] All-ones mask matching input_ids_chosen length.
input_ids_rejected list[int] Full tokenized rejected conversation (prompt + rejected response).
attention_mask_rejected list[int] All-ones mask matching input_ids_rejected length.

Outputs (from filter)

Output Type Description
Filtered dataset Dataset or DatasetDict Subset of input where all sequences are within configured length limits.

Usage Examples

from open_instruct.dataset_processor import PreferenceDatasetProcessor

# Assume `config` is a DatasetProcessorConfig and `tokenizer` is a PreTrainedTokenizer
processor = PreferenceDatasetProcessor(config=config, tokenizer=tokenizer)

# Tokenize the preference dataset
tokenized_dataset = processor.tokenize(raw_dataset)

# Filter out examples exceeding max length
filtered_dataset = processor.filter(tokenized_dataset)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment