Implementation:Huggingface Trl Get Dataset DataCollatorForPreference DPO
| Knowledge Sources | |
|---|---|
| Domains | NLP, RLHF |
| Last Updated | 2026-02-06 17:00 GMT |
Overview
Concrete tool for loading preference datasets and collating them into padded batches for DPO training, provided by the TRL library.
Description
The DPO data pipeline consists of two main components:
Dataset loading (in trl/scripts/dpo.py) supports two paths:
- DatasetMixtureConfig path: When
dataset_args.datasetsis provided, theget_datasetfunction loads and concatenates multiple datasets according to the mixture configuration, optionally creating a train/test split. - Single dataset path: When
script_args.dataset_nameis provided,datasets.load_datasetloads a single dataset with optional config name and streaming mode.
DataCollatorForPreference (in trl/trainer/dpo_trainer.py) is a dataclass-based collator that:
- Converts lists of integer IDs to tensors for prompt, chosen, and rejected sequences
- Left-pads prompt sequences (so the actual tokens are right-aligned for causal attention)
- Right-pads chosen and rejected completion sequences to the maximum length in the batch
- Creates corresponding attention masks (1 for real tokens, 0 for padding)
- Handles optional pixel values, pixel attention masks, image sizes (for vision-language models)
- Passes through precomputed reference log probabilities when available
The DPOTrainer's _prepare_dataset method orchestrates preprocessing: it extracts prompts via maybe_extract_prompt, applies chat templates via maybe_apply_chat_template, and tokenizes rows via the static tokenize_row method.
Usage
Use these components when:
- Loading standard preference datasets from the Hugging Face Hub
- Mixing multiple preference datasets for DPO training
- Preparing batches with dynamic padding for memory-efficient training
- Working with vision-language DPO (pixel values are handled by the collator)
Code Reference
Source Location
- Repository: TRL
- File (dataset loading):
trl/scripts/dpo.py(lines 121-135) - File (DataCollatorForPreference):
trl/trainer/dpo_trainer.py(lines 108-187) - File (get_dataset):
trl/scripts/utils.py(lines 414-475)
Signature
@dataclass
class DataCollatorForPreference(DataCollatorMixin):
"""Data collator for preference data with dynamic padding."""
pad_token_id: int
return_tensors: str = "pt"
def torch_call(
self, examples: list[list[int] | Any | dict[str, Any]]
) -> dict[str, Any]:
"""
Collates examples into a padded batch.
Expected keys in each example:
- "prompt_input_ids": list[int]
- "chosen_input_ids": list[int]
- "rejected_input_ids": list[int]
- "pixel_values" (optional): for vision models
- "ref_chosen_logps" (optional): precomputed reference log probs
- "ref_rejected_logps" (optional): precomputed reference log probs
"""
def get_dataset(mixture_config: DatasetMixtureConfig) -> DatasetDict:
"""Load a mixture of datasets based on the configuration."""
@staticmethod
def tokenize_row(
features: dict[str, str],
processing_class: PreTrainedTokenizerBase,
max_prompt_length: int | None = None,
max_completion_length: int | None = None,
add_special_tokens: bool = True,
is_chat: bool = False,
) -> dict[str, list[int]]:
"""Tokenize a row into prompt_input_ids, chosen_input_ids, rejected_input_ids."""
Import
from trl.trainer.dpo_trainer import DataCollatorForPreference
from trl import get_dataset, DatasetMixtureConfig
from datasets import load_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pad_token_id | int |
Yes | Token ID used for padding sequences in the collator |
| dataset_name | str |
Conditional | Hugging Face Hub dataset ID or local path (used when DatasetMixtureConfig is not provided) |
| datasets (mixture) | list[DatasetConfig] |
Conditional | List of dataset configurations for mixing multiple sources |
| prompt column | str |
Yes | The prompt text or conversation messages |
| chosen column | str |
Yes | The preferred response text or messages |
| rejected column | str |
Yes | The dispreferred response text or messages |
Outputs
| Name | Type | Description |
|---|---|---|
| prompt_input_ids | torch.Tensor |
Left-padded prompt token IDs, shape (batch_size, max_prompt_len)
|
| prompt_attention_mask | torch.Tensor |
Attention mask for prompts, shape (batch_size, max_prompt_len)
|
| chosen_input_ids | torch.Tensor |
Right-padded chosen completion token IDs, shape (batch_size, max_chosen_len)
|
| chosen_attention_mask | torch.Tensor |
Attention mask for chosen completions, shape (batch_size, max_chosen_len)
|
| rejected_input_ids | torch.Tensor |
Right-padded rejected completion token IDs, shape (batch_size, max_rejected_len)
|
| rejected_attention_mask | torch.Tensor |
Attention mask for rejected completions, shape (batch_size, max_rejected_len)
|
| ref_chosen_logps | torch.Tensor (optional) |
Precomputed reference log probs for chosen responses |
| ref_rejected_logps | torch.Tensor (optional) |
Precomputed reference log probs for rejected responses |
Usage Examples
# Example 1: Loading a preference dataset from the Hub
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized")
# dataset["train"] contains columns: prompt, chosen, rejected
# Example 2: Using DataCollatorForPreference directly
from trl.trainer.dpo_trainer import DataCollatorForPreference
collator = DataCollatorForPreference(pad_token_id=0)
examples = [
{
"prompt_input_ids": [1, 2, 3],
"chosen_input_ids": [4, 5],
"rejected_input_ids": [6],
},
{
"prompt_input_ids": [7, 8],
"chosen_input_ids": [9, 10],
"rejected_input_ids": [11, 12, 13],
},
]
batch = collator(examples)
# batch["prompt_input_ids"].shape == (2, 3) # left-padded
# batch["chosen_input_ids"].shape == (2, 2) # right-padded
# batch["rejected_input_ids"].shape == (2, 3) # right-padded
# Example 3: Using dataset mixture configuration
from trl import DatasetMixtureConfig, get_dataset
from trl.scripts.utils import DatasetConfig
mixture_config = DatasetMixtureConfig(
datasets=[
DatasetConfig(path="trl-lib/ultrafeedback_binarized", split="train"),
],
test_split_size=0.05,
)
dataset = get_dataset(mixture_config)
# Returns DatasetDict with "train" and "test" splits
# Example 4: Tokenizing a single row
from transformers import AutoTokenizer
from trl import DPOTrainer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
features = {
"prompt": "What is the capital of France?",
"chosen": " The capital of France is Paris.",
"rejected": " France is a country in Europe.",
}
tokenized = DPOTrainer.tokenize_row(
features,
processing_class=tokenizer,
max_prompt_length=512,
max_completion_length=None,
add_special_tokens=False,
is_chat=False,
)
# Returns: {"prompt_input_ids": [...], "chosen_input_ids": [...], "rejected_input_ids": [...]}