Principle:NVIDIA NeMo Aligner KTO Data Preparation

Knowledge Sources	NVIDIA_NeMo_Aligner
Domains	KTO, Data Preprocessing, Preference Learning
Last Updated	2026-02-08 00:00 GMT

Overview

KTO (Kahneman-Tversky Optimization) data preparation converts paired preference data into the binary feedback format required by KTO, where each sample is independently labeled as desirable or undesirable.

Description

KTO is an alignment method that, unlike DPO (Direct Preference Optimization), does not require paired comparisons (chosen vs. rejected for the same prompt). Instead, KTO operates on independently labeled samples, where each (prompt, response) pair is annotated with a binary signal indicating whether the response is "chosen" (desirable) or "rejected" (undesirable).

The KTO data preparation pipeline in NeMo Aligner converts the Anthropic Helpful-Harmless (HH-RLHF) dataset from its native paired preference format into the binary feedback format. Specifically:

Dataset loading: The Anthropic HH-RLHF dataset is downloaded from HuggingFace, containing paired chosen/rejected conversations.
Conversation parsing: Each raw conversation string (with \n\nHuman: and \n\nAssistant: delimiters) is parsed into a structured prompt-response format using Human:\n{body}\nAssistant:\n{response} templates.
Unpacking pairs: Each paired comparison is unpacked into two independent samples:
- The chosen response gets "preference": "chosen"
- The rejected response gets "preference": "rejected"
Each sample retains the shared prompt and its own response.
Output: The samples are saved as JSONL files with train and test splits.

This transformation is the key distinction from DPO data preparation: while DPO needs (prompt, chosen, rejected) tuples, KTO needs individual (prompt, response, preference_label) samples.

Usage

Use KTO data preparation when:

You are training a model using the KTO alignment algorithm
You need to convert paired preference data into binary feedback format
You want to use the Anthropic HH dataset with KTO training

Theoretical Basis

KTO is based on Kahneman-Tversky prospect theory from behavioral economics. The key insights are:

Loss aversion: Humans weigh losses more heavily than equivalent gains. KTO incorporates this asymmetry by treating desirable and undesirable examples differently in the loss function.
Binary feedback sufficiency: Unlike DPO which requires explicit pairwise comparisons, KTO can learn from independent binary signals (good/bad) about individual responses. This is practically advantageous because binary feedback is often easier and cheaper to collect than pairwise comparisons.
Reference-free evaluation: Each sample is evaluated independently, so the training data does not need to maintain the pairing structure between chosen and rejected responses for the same prompt.

The data preparation step is critical because it transforms the commonly available paired preference format (as used in RLHF and DPO) into the unpacked binary format that KTO expects. Each original comparison pair yields two training samples, effectively doubling the dataset size while changing the supervision signal from comparative to absolute.

Related Pages

Implementation:NVIDIA_NeMo_Aligner_Preprocess_AnthropicHH_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment