Principle:NVIDIA NeMo Aligner SFT Data Preparation
| Principle: SFT Data Preparation | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | NLP, Data_Engineering |
| Related Implementations | Implementation:NVIDIA_NeMo_Aligner_Build_SFT_Dataset |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Process of constructing supervised fine-tuning datasets from instruction-response pairs with appropriate tokenization and formatting.
Description
Supervised fine-tuning requires converting raw instruction-response data (in JSONL format) into tokenized, padded sequences suitable for autoregressive training. This includes:
- Dataset class selection -- choosing the appropriate class based on data format (chat, plain, or packed sequences)
- Answer-only loss masking -- ensuring the model only learns to generate responses, not repeat prompts
- Special token handling -- inserting BOS, EOS, and role-specific tokens as required by the tokenizer
- Sequence padding and truncation -- normalizing all examples to a fixed sequence length
Packed sequences concatenate multiple short examples into a single training sample for improved GPU utilization. This is particularly important when training on datasets with high variance in example length, where naive padding would waste significant compute.
Usage
Use when preparing training data for supervised fine-tuning of language models. The format choice depends on the data structure:
- Chat format -- for multi-turn conversational data with system/user/assistant roles
- Packed sequences -- for throughput optimization when working with short examples
- Plain format -- for simple prompt-completion pairs without role structure
Each format requires its own dataset class with specific tokenization and masking logic.
Theoretical Basis
The core theoretical concept is answer-only loss masking: the cross-entropy loss is computed only on response tokens, treating prompt tokens as context that the model conditions on but is not penalized for.
For a sequence: [PROMPT_TOKENS] [RESPONSE_TOKENS] [PAD_TOKENS]
Labels become: [-100 ... -100] [RESPONSE_IDS] [-100 ... -100]
Where -100 is the ignore_index for cross-entropy loss.
For packed sequences, additional mechanisms prevent cross-contamination between concatenated examples:
Packed sequence: [Example_A tokens] [Example_B tokens] [Example_C tokens]
Attention mask: cu_seqlens = [0, len_A, len_A + len_B, len_A + len_B + len_C]
Each example attends only to its own tokens via custom attention masking.
The cu_seqlens (cumulative sequence lengths) define attention boundaries.
The dataset returns tokenized tensors with the following structure:
- input_ids -- the full tokenized sequence
- labels -- token IDs for response positions, -100 elsewhere
- attention_mask -- standard causal mask (or cu_seqlens for packed)
- loss_mask -- binary mask indicating which positions contribute to loss