Principle:Hiyouga LLaMA Factory Kahneman Tversky Optimization
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Language Model Alignment, Preference Learning, Behavioral Economics |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
A preference alignment technique inspired by Kahneman and Tversky's prospect theory that aligns language models using per-example binary feedback (desirable/undesirable) rather than pairwise preference comparisons.
Description
Kahneman-Tversky Optimization (KTO), introduced by Ethayarajh et al. (2024), is an alignment method that draws on insights from behavioral economics. Unlike DPO, which requires paired chosen/rejected responses for the same prompt, KTO operates on unpaired binary feedback: each response is independently labeled as either desirable or undesirable. This is grounded in prospect theory's observation that humans evaluate outcomes relative to a reference point, with losses being weighted more heavily than equivalent gains.
KTO is significant because:
- Lower data requirements: It does not require paired preferences -- each example needs only a single binary label (thumbs up or thumbs down).
- More natural feedback signal: Binary approval/disapproval is easier to collect at scale than pairwise comparisons.
- Asymmetric loss weighting: Desirable and undesirable examples can be weighted differently, reflecting the empirical finding that humans are more sensitive to losses than to gains.
- KL-anchored alignment: A KL divergence term computed over separate KL examples prevents the policy from deviating too far from the reference distribution.
Usage
Use KTO when you want to:
- Align a language model using binary feedback data (approve/reject per response) rather than paired comparisons.
- Leverage existing datasets where responses are independently rated without paired alternatives.
- Apply asymmetric weighting to penalize bad outputs more heavily than rewarding good ones.
- Avoid the paired data requirement of DPO while maintaining stable alignment.
KTO is particularly suitable when collecting pairwise preferences is impractical, such as in production settings where user feedback is collected as binary thumbs-up/thumbs-down signals.
Theoretical Basis
Prospect Theory Foundation
KTO is grounded in Kahneman and Tversky's prospect theory, which models human decision-making under uncertainty. The key insight is that humans evaluate outcomes as gains or losses relative to a reference point rather than in absolute terms, and that the value function is asymmetric: losses loom larger than gains.
The implicit reward for a response given prompt is:
KTO Loss Function
The KTO loss separates the treatment of desirable and undesirable examples:
where the value function is defined differently for desirable and undesirable examples:
Here is the KL-based reference point, is the sigmoid function, and is the per-class weight:
The weights (desirable weight) and (undesirable weight) allow asymmetric treatment, reflecting prospect theory's loss aversion. Typically to penalize undesirable behavior more strongly.
KL Reference Point
The KL reference point is estimated using separate KL examples. For each training example, an additional KL response is sampled to compute the implicit KL divergence, ensuring the policy stays anchored to the reference model distribution.
Auxiliary SFT Loss
Similar to DPO, an optional auxiliary SFT loss on desirable examples can be added:
This helps maintain generation quality while aligning to preference feedback.