Principle:NVIDIA NeMo Aligner KTO Training

Knowledge Sources	NVIDIA_NeMo_Aligner
Domains	NLP, Alignment
Last Updated	2026-02-08 00:00 GMT

Overview

KTO (Kahneman-Tversky Optimization) is an alignment algorithm that uses binary feedback (desirable/undesirable) rather than paired preferences, drawing on prospect theory to asymmetrically weight gains and losses relative to a KL-based reference point.

Description

Unlike DPO, which requires paired preference data (a chosen and a rejected response for the same prompt), KTO operates on unpaired binary-labeled data. Each sample is independently labeled as either desirable (good) or undesirable (bad). This makes KTO applicable to a broader range of feedback signals, including thumbs-up/thumbs-down ratings.

The KTO training process works as follows:

Data preparation: Each training sample consists of a prompt, a response, and a binary preference label (1 for desirable, 0 for undesirable). The kto_custom_collate function constructs KL estimation samples by pairing each prompt with the response from the next sample in the batch, creating mismatched prompt-response pairs.
Reference policy log-probabilities: Before each training step, the reference policy log-probabilities are computed for both the original samples and the KL estimation samples. This is done by temporarily swapping in reference policy weights (or disabling adapters when using PEFT).
Loss computation: The loss uses the KL divergence estimated from the mismatched samples as a reference point. For desirable samples, the loss encourages the reward (log-probability ratio vs. reference) to exceed the KL baseline. For undesirable samples, the loss encourages the reward to fall below the KL baseline.
Asymmetric weighting: Desirable and undesirable losses can be weighted differently via desirable_loss_weight and undesirable_loss_weight, reflecting the asymmetric value function from prospect theory.

Usage

KTO training is appropriate when:

You have binary feedback data (good/bad labels) rather than paired comparisons.
You want to align a model using thumbs-up/thumbs-down style ratings.
Your feedback data contains a mix of desirable and undesirable examples that are not necessarily paired.
You prefer a method grounded in behavioral economics (prospect theory) that naturally handles asymmetric preferences.

Theoretical Basis

KTO is inspired by Kahneman and Tversky's Prospect Theory, which models human decision-making under uncertainty. The key insight is that humans evaluate outcomes relative to a reference point, and they are typically more sensitive to losses than to equivalent gains.

In KTO, the reference point is the expected KL divergence between the policy and the reference model, estimated from non-matching prompt-response pairs in the batch:

KL_ref = E_{x,y' ~ mismatch} [ max(0, beta * (log pi(y'|x) - log pi_ref(y'|x))) ]

For a sample (x, y) with reward r(x,y) = beta * sum_t (log pi(y_t|x,y_{<t}) - log pi_ref(y_t|x,y_{<t})) * mask_t:

Desirable samples (preference = 1): L_des = 1 - sigmoid(r(x,y) - KL_ref)
Undesirable samples (preference = 0): L_und = 1 - sigmoid(KL_ref - r(x,y))

The total loss is:

L = w_des * mean(L_des) + w_und * mean(L_und)

where w_des and w_und are the desirable and undesirable loss weights, respectively, and beta is the ref_policy_kl_penalty parameter.

This formulation ensures that for desirable responses, the model is encouraged to increase their likelihood beyond the KL baseline, while for undesirable responses, the model is encouraged to decrease their likelihood below the baseline. The asymmetric weighting allows the training to emphasize either avoiding bad outputs or pursuing good outputs, depending on the application.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment