Principle:NVIDIA NeMo Aligner SFT Data Preparation

Principle: SFT Data Preparation
Type	Principle
Project	NVIDIA NeMo Aligner
Domains	NLP, Data_Engineering
Related Implementations	Implementation:NVIDIA_NeMo_Aligner_Build_SFT_Dataset
Last Updated	2026-02-07 00:00 GMT

Overview

Process of constructing supervised fine-tuning datasets from instruction-response pairs with appropriate tokenization and formatting.

Description

Supervised fine-tuning requires converting raw instruction-response data (in JSONL format) into tokenized, padded sequences suitable for autoregressive training. This includes:

Dataset class selection -- choosing the appropriate class based on data format (chat, plain, or packed sequences)
Answer-only loss masking -- ensuring the model only learns to generate responses, not repeat prompts
Special token handling -- inserting BOS, EOS, and role-specific tokens as required by the tokenizer
Sequence padding and truncation -- normalizing all examples to a fixed sequence length

Packed sequences concatenate multiple short examples into a single training sample for improved GPU utilization. This is particularly important when training on datasets with high variance in example length, where naive padding would waste significant compute.

Usage

Use when preparing training data for supervised fine-tuning of language models. The format choice depends on the data structure:

Chat format -- for multi-turn conversational data with system/user/assistant roles
Packed sequences -- for throughput optimization when working with short examples
Plain format -- for simple prompt-completion pairs without role structure

Each format requires its own dataset class with specific tokenization and masking logic.

Theoretical Basis

The core theoretical concept is answer-only loss masking: the cross-entropy loss is computed only on response tokens, treating prompt tokens as context that the model conditions on but is not penalized for.

For a sequence: [PROMPT_TOKENS] [RESPONSE_TOKENS] [PAD_TOKENS]
Labels become:  [-100 ... -100] [RESPONSE_IDS]  [-100 ... -100]

Where -100 is the ignore_index for cross-entropy loss.

For packed sequences, additional mechanisms prevent cross-contamination between concatenated examples:

Packed sequence: [Example_A tokens] [Example_B tokens] [Example_C tokens]
Attention mask:  cu_seqlens = [0, len_A, len_A + len_B, len_A + len_B + len_C]

Each example attends only to its own tokens via custom attention masking.
The cu_seqlens (cumulative sequence lengths) define attention boundaries.

The dataset returns tokenized tensors with the following structure:

input_ids -- the full tokenized sequence
labels -- token IDs for response positions, -100 elsewhere
attention_mask -- standard causal mask (or cu_seqlens for packed)
loss_mask -- binary mask indicating which positions contribute to loss

Related Pages

Implementation:NVIDIA_NeMo_Aligner_Build_SFT_Dataset

Knowledge Sources

NLP | Data_Engineering

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment