Implementation:Allenai Open instruct PreferenceDatasetProcessor
| Component Type | Class |
|---|---|
| Source | open_instruct/dataset_processor.py (Lines 229-276)
|
| Repository | Open Instruct |
| Dependencies | transformers, datasets |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for tokenizing and filtering preference datasets (chosen/rejected response pairs) provided by the Open Instruct library.
Description
PreferenceDatasetProcessor is a subclass of DatasetProcessor that handles the transformation of raw preference datasets into a tokenized format suitable for DPO and reward model training. It provides two primary methods:
tokenize(dataset)-- Applies a chat template to each example's chosen and rejected message lists, producinginput_ids_prompt,input_ids_chosen,input_ids_rejected, and their corresponding attention masks. The prompt is extracted from the chosen messages (all messages except the last) and tokenized withadd_generation_prompt=True.
filter(dataset)-- Removes examples whose tokenized sequences exceed configured maximum lengths (max_prompt_token_lengthfor prompts,max_token_lengthfor chosen/rejected sequences). Reports the percentage of filtered examples via logging.
Both methods leverage HuggingFace datasets.Dataset.map() and datasets.Dataset.filter() with configurable multiprocessing via num_proc.
Usage
Import and use PreferenceDatasetProcessor when preparing preference data for DPO training, reward model training, or any pipeline that requires tokenized chosen/rejected response pairs.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/dataset_processor.py(Lines 229-276)
Signature
class PreferenceDatasetProcessor(DatasetProcessor):
def tokenize(self, dataset: Union[Dataset, DatasetDict]) -> Union[Dataset, DatasetDict]:
"""Apply chat templates to chosen/rejected pairs, producing token IDs and attention masks."""
...
def filter(self, dataset: Union[Dataset, DatasetDict]) -> Union[Dataset, DatasetDict]:
"""Filter examples exceeding max_prompt_token_length or max_token_length."""
...
Import
from open_instruct.dataset_processor import PreferenceDatasetProcessor
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
dataset |
Dataset or DatasetDict |
Raw preference dataset containing chosen and rejected message columns (configured via config.preference_chosen_key and config.preference_rejected_key). Each row's chosen/rejected fields are lists of chat messages.
|
Outputs (from tokenize)
| Column Key | Type | Description |
|---|---|---|
input_ids_prompt |
list[int] |
Tokenized prompt (chosen messages minus last message), with generation prompt appended. |
attention_mask_prompt |
list[int] |
All-ones mask matching input_ids_prompt length.
|
input_ids_chosen |
list[int] |
Full tokenized chosen conversation (prompt + chosen response). |
attention_mask_chosen |
list[int] |
All-ones mask matching input_ids_chosen length.
|
input_ids_rejected |
list[int] |
Full tokenized rejected conversation (prompt + rejected response). |
attention_mask_rejected |
list[int] |
All-ones mask matching input_ids_rejected length.
|
Outputs (from filter)
| Output | Type | Description |
|---|---|---|
| Filtered dataset | Dataset or DatasetDict |
Subset of input where all sequences are within configured length limits. |
Usage Examples
from open_instruct.dataset_processor import PreferenceDatasetProcessor
# Assume `config` is a DatasetProcessorConfig and `tokenizer` is a PreTrainedTokenizer
processor = PreferenceDatasetProcessor(config=config, tokenizer=tokenizer)
# Tokenize the preference dataset
tokenized_dataset = processor.tokenize(raw_dataset)
# Filter out examples exceeding max length
filtered_dataset = processor.filter(tokenized_dataset)