Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Trl Get Dataset DataCollatorForPreference DPO

From Leeroopedia


Knowledge Sources
Domains NLP, RLHF
Last Updated 2026-02-06 17:00 GMT

Overview

Concrete tool for loading preference datasets and collating them into padded batches for DPO training, provided by the TRL library.

Description

The DPO data pipeline consists of two main components:

Dataset loading (in trl/scripts/dpo.py) supports two paths:

  • DatasetMixtureConfig path: When dataset_args.datasets is provided, the get_dataset function loads and concatenates multiple datasets according to the mixture configuration, optionally creating a train/test split.
  • Single dataset path: When script_args.dataset_name is provided, datasets.load_dataset loads a single dataset with optional config name and streaming mode.

DataCollatorForPreference (in trl/trainer/dpo_trainer.py) is a dataclass-based collator that:

  1. Converts lists of integer IDs to tensors for prompt, chosen, and rejected sequences
  2. Left-pads prompt sequences (so the actual tokens are right-aligned for causal attention)
  3. Right-pads chosen and rejected completion sequences to the maximum length in the batch
  4. Creates corresponding attention masks (1 for real tokens, 0 for padding)
  5. Handles optional pixel values, pixel attention masks, image sizes (for vision-language models)
  6. Passes through precomputed reference log probabilities when available

The DPOTrainer's _prepare_dataset method orchestrates preprocessing: it extracts prompts via maybe_extract_prompt, applies chat templates via maybe_apply_chat_template, and tokenizes rows via the static tokenize_row method.

Usage

Use these components when:

  • Loading standard preference datasets from the Hugging Face Hub
  • Mixing multiple preference datasets for DPO training
  • Preparing batches with dynamic padding for memory-efficient training
  • Working with vision-language DPO (pixel values are handled by the collator)

Code Reference

Source Location

  • Repository: TRL
  • File (dataset loading): trl/scripts/dpo.py (lines 121-135)
  • File (DataCollatorForPreference): trl/trainer/dpo_trainer.py (lines 108-187)
  • File (get_dataset): trl/scripts/utils.py (lines 414-475)

Signature

@dataclass
class DataCollatorForPreference(DataCollatorMixin):
    """Data collator for preference data with dynamic padding."""

    pad_token_id: int
    return_tensors: str = "pt"

    def torch_call(
        self, examples: list[list[int] | Any | dict[str, Any]]
    ) -> dict[str, Any]:
        """
        Collates examples into a padded batch.

        Expected keys in each example:
          - "prompt_input_ids": list[int]
          - "chosen_input_ids": list[int]
          - "rejected_input_ids": list[int]
          - "pixel_values" (optional): for vision models
          - "ref_chosen_logps" (optional): precomputed reference log probs
          - "ref_rejected_logps" (optional): precomputed reference log probs
        """
def get_dataset(mixture_config: DatasetMixtureConfig) -> DatasetDict:
    """Load a mixture of datasets based on the configuration."""
@staticmethod
def tokenize_row(
    features: dict[str, str],
    processing_class: PreTrainedTokenizerBase,
    max_prompt_length: int | None = None,
    max_completion_length: int | None = None,
    add_special_tokens: bool = True,
    is_chat: bool = False,
) -> dict[str, list[int]]:
    """Tokenize a row into prompt_input_ids, chosen_input_ids, rejected_input_ids."""

Import

from trl.trainer.dpo_trainer import DataCollatorForPreference
from trl import get_dataset, DatasetMixtureConfig
from datasets import load_dataset

I/O Contract

Inputs

Name Type Required Description
pad_token_id int Yes Token ID used for padding sequences in the collator
dataset_name str Conditional Hugging Face Hub dataset ID or local path (used when DatasetMixtureConfig is not provided)
datasets (mixture) list[DatasetConfig] Conditional List of dataset configurations for mixing multiple sources
prompt column str Yes The prompt text or conversation messages
chosen column str Yes The preferred response text or messages
rejected column str Yes The dispreferred response text or messages

Outputs

Name Type Description
prompt_input_ids torch.Tensor Left-padded prompt token IDs, shape (batch_size, max_prompt_len)
prompt_attention_mask torch.Tensor Attention mask for prompts, shape (batch_size, max_prompt_len)
chosen_input_ids torch.Tensor Right-padded chosen completion token IDs, shape (batch_size, max_chosen_len)
chosen_attention_mask torch.Tensor Attention mask for chosen completions, shape (batch_size, max_chosen_len)
rejected_input_ids torch.Tensor Right-padded rejected completion token IDs, shape (batch_size, max_rejected_len)
rejected_attention_mask torch.Tensor Attention mask for rejected completions, shape (batch_size, max_rejected_len)
ref_chosen_logps torch.Tensor (optional) Precomputed reference log probs for chosen responses
ref_rejected_logps torch.Tensor (optional) Precomputed reference log probs for rejected responses

Usage Examples

# Example 1: Loading a preference dataset from the Hub
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized")
# dataset["train"] contains columns: prompt, chosen, rejected
# Example 2: Using DataCollatorForPreference directly
from trl.trainer.dpo_trainer import DataCollatorForPreference

collator = DataCollatorForPreference(pad_token_id=0)

examples = [
    {
        "prompt_input_ids": [1, 2, 3],
        "chosen_input_ids": [4, 5],
        "rejected_input_ids": [6],
    },
    {
        "prompt_input_ids": [7, 8],
        "chosen_input_ids": [9, 10],
        "rejected_input_ids": [11, 12, 13],
    },
]

batch = collator(examples)
# batch["prompt_input_ids"].shape == (2, 3)   # left-padded
# batch["chosen_input_ids"].shape == (2, 2)   # right-padded
# batch["rejected_input_ids"].shape == (2, 3) # right-padded
# Example 3: Using dataset mixture configuration
from trl import DatasetMixtureConfig, get_dataset
from trl.scripts.utils import DatasetConfig

mixture_config = DatasetMixtureConfig(
    datasets=[
        DatasetConfig(path="trl-lib/ultrafeedback_binarized", split="train"),
    ],
    test_split_size=0.05,
)
dataset = get_dataset(mixture_config)
# Returns DatasetDict with "train" and "test" splits
# Example 4: Tokenizing a single row
from transformers import AutoTokenizer
from trl import DPOTrainer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
features = {
    "prompt": "What is the capital of France?",
    "chosen": " The capital of France is Paris.",
    "rejected": " France is a country in Europe.",
}

tokenized = DPOTrainer.tokenize_row(
    features,
    processing_class=tokenizer,
    max_prompt_length=512,
    max_completion_length=None,
    add_special_tokens=False,
    is_chat=False,
)
# Returns: {"prompt_input_ids": [...], "chosen_input_ids": [...], "rejected_input_ids": [...]}

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment