Principle:Hiyouga LLaMA Factory Kahneman Tversky Optimization

Knowledge Sources	Hiyouga_LLaMA_Factory KTO: Model Alignment as Prospect Theoretic Optimization
Domains	Natural Language Processing, Language Model Alignment, Preference Learning, Behavioral Economics
Last Updated	2026-02-06 19:00 GMT

Overview

A preference alignment technique inspired by Kahneman and Tversky's prospect theory that aligns language models using per-example binary feedback (desirable/undesirable) rather than pairwise preference comparisons.

Description

Kahneman-Tversky Optimization (KTO), introduced by Ethayarajh et al. (2024), is an alignment method that draws on insights from behavioral economics. Unlike DPO, which requires paired chosen/rejected responses for the same prompt, KTO operates on unpaired binary feedback: each response is independently labeled as either desirable or undesirable. This is grounded in prospect theory's observation that humans evaluate outcomes relative to a reference point, with losses being weighted more heavily than equivalent gains.

KTO is significant because:

Lower data requirements: It does not require paired preferences -- each example needs only a single binary label (thumbs up or thumbs down).
More natural feedback signal: Binary approval/disapproval is easier to collect at scale than pairwise comparisons.
Asymmetric loss weighting: Desirable and undesirable examples can be weighted differently, reflecting the empirical finding that humans are more sensitive to losses than to gains.
KL-anchored alignment: A KL divergence term computed over separate KL examples prevents the policy from deviating too far from the reference distribution.

Usage

Use KTO when you want to:

Align a language model using binary feedback data (approve/reject per response) rather than paired comparisons.
Leverage existing datasets where responses are independently rated without paired alternatives.
Apply asymmetric weighting to penalize bad outputs more heavily than rewarding good ones.
Avoid the paired data requirement of DPO while maintaining stable alignment.

KTO is particularly suitable when collecting pairwise preferences is impractical, such as in production settings where user feedback is collected as binary thumbs-up/thumbs-down signals.

Theoretical Basis

Prospect Theory Foundation

KTO is grounded in Kahneman and Tversky's prospect theory, which models human decision-making under uncertainty. The key insight is that humans evaluate outcomes as gains or losses relative to a reference point rather than in absolute terms, and that the value function is asymmetric: losses loom larger than gains.

The implicit reward for a response $y$ given prompt $x$ is:

$r_{θ} (x, y) = β \log \frac{π_{θ} (y ∣ x)}{π_{ref} (y ∣ x)}$

KTO Loss Function

The KTO loss separates the treatment of desirable and undesirable examples:

$ℒ_{KTO} (θ) = 𝔼_{(x, y)} [w (y) \cdot (1 - v_{θ} (x, y))]$

where the value function $v_{θ}$ is defined differently for desirable and undesirable examples:

$v_{θ} (x, y) = {\begin{cases} σ (r_{θ} (x, y) - z_{ref}) & if y is desirable \\ σ (z_{ref} - r_{θ} (x, y)) & if y is undesirable \end{cases}$

Here $z_{ref} = 𝔼_{x^{'} \sim 𝒟} [β KL (π_{θ} (\cdot ∣ x^{'}) ‖ π_{ref} (\cdot ∣ x^{'}))]$ is the KL-based reference point, $σ$ is the sigmoid function, and $w (y)$ is the per-class weight:

$w (y) = {\begin{cases} λ_{D} & if y is desirable \\ λ_{U} & if y is undesirable \end{cases}$

The weights $λ_{D}$ (desirable weight) and $λ_{U}$ (undesirable weight) allow asymmetric treatment, reflecting prospect theory's loss aversion. Typically $λ_{U} > λ_{D}$ to penalize undesirable behavior more strongly.

KL Reference Point

The KL reference point $z_{ref}$ is estimated using separate KL examples. For each training example, an additional KL response is sampled to compute the implicit KL divergence, ensuring the policy stays anchored to the reference model distribution.

Auxiliary SFT Loss

Similar to DPO, an optional auxiliary SFT loss on desirable examples can be added:

$ℒ_{total} = ℒ_{KTO} + γ_{ftx} \cdot ℒ_{SFT} (y_{desirable})$

This helps maintain generation quality while aligning to preference feedback.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment