Implementation:Huggingface Trl Get Dataset DataCollatorForPreference DPO

Knowledge Sources	TRL TRL Docs
Domains	NLP, RLHF
Last Updated	2026-02-06 17:00 GMT

Overview

Concrete tool for loading preference datasets and collating them into padded batches for DPO training, provided by the TRL library.

Description

The DPO data pipeline consists of two main components:

Dataset loading (in trl/scripts/dpo.py) supports two paths:

DatasetMixtureConfig path: When dataset_args.datasets is provided, the get_dataset function loads and concatenates multiple datasets according to the mixture configuration, optionally creating a train/test split.
Single dataset path: When script_args.dataset_name is provided, datasets.load_dataset loads a single dataset with optional config name and streaming mode.

DataCollatorForPreference (in trl/trainer/dpo_trainer.py) is a dataclass-based collator that:

Converts lists of integer IDs to tensors for prompt, chosen, and rejected sequences
Left-pads prompt sequences (so the actual tokens are right-aligned for causal attention)
Right-pads chosen and rejected completion sequences to the maximum length in the batch
Creates corresponding attention masks (1 for real tokens, 0 for padding)
Handles optional pixel values, pixel attention masks, image sizes (for vision-language models)
Passes through precomputed reference log probabilities when available

The DPOTrainer's _prepare_dataset method orchestrates preprocessing: it extracts prompts via maybe_extract_prompt, applies chat templates via maybe_apply_chat_template, and tokenizes rows via the static tokenize_row method.

Usage

Use these components when:

Loading standard preference datasets from the Hugging Face Hub
Mixing multiple preference datasets for DPO training
Preparing batches with dynamic padding for memory-efficient training
Working with vision-language DPO (pixel values are handled by the collator)

Code Reference

Source Location

Repository: TRL
File (dataset loading): trl/scripts/dpo.py (lines 121-135)
File (DataCollatorForPreference): trl/trainer/dpo_trainer.py (lines 108-187)
File (get_dataset): trl/scripts/utils.py (lines 414-475)

Signature

@dataclass
class DataCollatorForPreference(DataCollatorMixin):
    """Data collator for preference data with dynamic padding."""

    pad_token_id: int
    return_tensors: str = "pt"

    def torch_call(
        self, examples: list[list[int] | Any | dict[str, Any]]
    ) -> dict[str, Any]:
        """
        Collates examples into a padded batch.

        Expected keys in each example:
          - "prompt_input_ids": list[int]
          - "chosen_input_ids": list[int]
          - "rejected_input_ids": list[int]
          - "pixel_values" (optional): for vision models
          - "ref_chosen_logps" (optional): precomputed reference log probs
          - "ref_rejected_logps" (optional): precomputed reference log probs
        """

def get_dataset(mixture_config: DatasetMixtureConfig) -> DatasetDict:
    """Load a mixture of datasets based on the configuration."""

@staticmethod
def tokenize_row(
    features: dict[str, str],
    processing_class: PreTrainedTokenizerBase,
    max_prompt_length: int | None = None,
    max_completion_length: int | None = None,
    add_special_tokens: bool = True,
    is_chat: bool = False,
) -> dict[str, list[int]]:
    """Tokenize a row into prompt_input_ids, chosen_input_ids, rejected_input_ids."""

Import

from trl.trainer.dpo_trainer import DataCollatorForPreference
from trl import get_dataset, DatasetMixtureConfig
from datasets import load_dataset

I/O Contract

Inputs

Name	Type	Required	Description
pad_token_id	`int`	Yes	Token ID used for padding sequences in the collator
dataset_name	`str`	Conditional	Hugging Face Hub dataset ID or local path (used when DatasetMixtureConfig is not provided)
datasets (mixture)	`list[DatasetConfig]`	Conditional	List of dataset configurations for mixing multiple sources
prompt column	`str`	Yes	The prompt text or conversation messages
chosen column	`str`	Yes	The preferred response text or messages
rejected column	`str`	Yes	The dispreferred response text or messages

Outputs

Name	Type	Description
prompt_input_ids	`torch.Tensor`	Left-padded prompt token IDs, shape `(batch_size, max_prompt_len)`
prompt_attention_mask	`torch.Tensor`	Attention mask for prompts, shape `(batch_size, max_prompt_len)`
chosen_input_ids	`torch.Tensor`	Right-padded chosen completion token IDs, shape `(batch_size, max_chosen_len)`
chosen_attention_mask	`torch.Tensor`	Attention mask for chosen completions, shape `(batch_size, max_chosen_len)`
rejected_input_ids	`torch.Tensor`	Right-padded rejected completion token IDs, shape `(batch_size, max_rejected_len)`
rejected_attention_mask	`torch.Tensor`	Attention mask for rejected completions, shape `(batch_size, max_rejected_len)`
ref_chosen_logps	`torch.Tensor` (optional)	Precomputed reference log probs for chosen responses
ref_rejected_logps	`torch.Tensor` (optional)	Precomputed reference log probs for rejected responses

Usage Examples

# Example 1: Loading a preference dataset from the Hub
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized")
# dataset["train"] contains columns: prompt, chosen, rejected

# Example 2: Using DataCollatorForPreference directly
from trl.trainer.dpo_trainer import DataCollatorForPreference

collator = DataCollatorForPreference(pad_token_id=0)

examples = [
    {
        "prompt_input_ids": [1, 2, 3],
        "chosen_input_ids": [4, 5],
        "rejected_input_ids": [6],
    },
    {
        "prompt_input_ids": [7, 8],
        "chosen_input_ids": [9, 10],
        "rejected_input_ids": [11, 12, 13],
    },
]

batch = collator(examples)
# batch["prompt_input_ids"].shape == (2, 3)   # left-padded
# batch["chosen_input_ids"].shape == (2, 2)   # right-padded
# batch["rejected_input_ids"].shape == (2, 3) # right-padded

# Example 3: Using dataset mixture configuration
from trl import DatasetMixtureConfig, get_dataset
from trl.scripts.utils import DatasetConfig

mixture_config = DatasetMixtureConfig(
    datasets=[
        DatasetConfig(path="trl-lib/ultrafeedback_binarized", split="train"),
    ],
    test_split_size=0.05,
)
dataset = get_dataset(mixture_config)
# Returns DatasetDict with "train" and "test" splits

# Example 4: Tokenizing a single row
from transformers import AutoTokenizer
from trl import DPOTrainer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
features = {
    "prompt": "What is the capital of France?",
    "chosen": " The capital of France is Paris.",
    "rejected": " France is a country in Europe.",
}

tokenized = DPOTrainer.tokenize_row(
    features,
    processing_class=tokenizer,
    max_prompt_length=512,
    max_completion_length=None,
    add_special_tokens=False,
    is_chat=False,
)
# Returns: {"prompt_input_ids": [...], "chosen_input_ids": [...], "rejected_input_ids": [...]}

Related Pages

Implements Principle

Principle:Huggingface_Trl_DPO_Preference_Dataset_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment