Principle:OpenRLHF OpenRLHF KTO Training
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Reinforcement_Learning, Optimization |
| Last Updated | 2026-02-07 10:40 GMT |
Overview
Alignment technique that trains language models using unpaired preference data based on Kahneman-Tversky prospect theory.
Description
Kahneman-Tversky Optimization (KTO) is a human alignment method that operates on unpaired preference data, where each sample is independently labeled as desirable or undesirable. Unlike DPO which requires paired (chosen, rejected) examples, KTO leverages the insight from prospect theory that humans value losses more than equivalent gains. The loss function applies asymmetric weighting: undesirable outputs are penalized more heavily than desirable outputs are rewarded. A KL divergence term estimated from unmatched prompt-response pairs regularizes the policy to stay close to the reference model.
Usage
Use KTO training when you have preference data that is not naturally paired. This is common when feedback is collected independently (e.g., thumbs up/down on individual responses) rather than as side-by-side comparisons. KTO is a simpler alternative to DPO when paired data is unavailable, while still achieving competitive alignment quality.
Theoretical Basis
The KTO loss decomposes into desirable and undesirable terms:
Where:
- is the implicit reward
- is the KL divergence estimated from unmatched pairs
- controls regularization strength
- are loss weights for desirable and undesirable samples
Pseudo-code Logic:
# Abstract algorithm (NOT actual implementation)
policy_logps = compute_logprobs(policy_model, inputs)
ref_logps = compute_logprobs(ref_model, inputs)
kl_estimate = compute_kl_from_unmatched_pairs(policy_logps, ref_logps)
for sample in batch:
reward = policy_logps[sample] - ref_logps[sample]
if sample.label == desirable:
loss += w_d * (1 - sigmoid(beta * (reward - kl_estimate)))
else:
loss += w_u * (1 - sigmoid(beta * (kl_estimate - reward)))