Principle:Eric mitchell Direct preference optimization Tokenization

Knowledge Sources	HuggingFace Tokenizers Direct Preference Optimization
Domains	Preprocessing, NLP, Text_Processing
Last Updated	2026-02-08 02:00 GMT

Overview

A text-to-tokens conversion technique that encodes prompt-response pairs into token ID sequences with proper truncation, label masking, and EOS token handling for preference-based training.

Description

Tokenization for preference training converts a (prompt, chosen_response, rejected_response) triple into tokenized sequences suitable for language model training. The key challenges are:

Prompt-response concatenation: The prompt and response are tokenized separately, then concatenated, allowing precise control over truncation boundaries.
Label masking: Prompt tokens are masked with -100 in the labels so that the loss is only computed on response tokens. This ensures the model learns to generate responses, not to memorize prompts.
Truncation strategy: If the combined sequence exceeds max_length, the prompt is truncated first (from the start or end depending on dataset), then the response if still too long.
EOS token: An EOS token is appended to each response to teach the model when to stop generating.

Usage

Use this principle when preparing individual examples for language model training on preference data. Each example is tokenized independently, then batched and padded by the collation function.

Theoretical Basis

For autoregressive language models, the training objective is to predict the next token given all previous tokens. Label masking ensures that only the response tokens contribute to the loss:

$ℒ = - \sum_{t \in response} \log P_{θ} (y_{t} | x, y_{< t})$

where tokens in the prompt region have labels set to -100 (PyTorch's ignore index), excluding them from the cross-entropy computation.

Pseudo-code:

# Abstract tokenization (NOT actual implementation)
prompt_tokens = tokenize(prompt)
response_tokens = tokenize(response) + [EOS]
truncate_if_too_long(prompt_tokens, response_tokens, max_length)
input_ids = prompt_tokens + response_tokens
labels = [-100] * len(prompt_tokens) + response_tokens

Related Pages

Implemented By

Implementation:Eric_mitchell_Direct_preference_optimization_Tokenize_Batch_Element

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment