Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner SFT Data Preparation

From Leeroopedia


Principle: SFT Data Preparation
Type Principle
Project NVIDIA NeMo Aligner
Domains NLP, Data_Engineering
Related Implementations Implementation:NVIDIA_NeMo_Aligner_Build_SFT_Dataset
Last Updated 2026-02-07 00:00 GMT

Overview

Process of constructing supervised fine-tuning datasets from instruction-response pairs with appropriate tokenization and formatting.

Description

Supervised fine-tuning requires converting raw instruction-response data (in JSONL format) into tokenized, padded sequences suitable for autoregressive training. This includes:

  • Dataset class selection -- choosing the appropriate class based on data format (chat, plain, or packed sequences)
  • Answer-only loss masking -- ensuring the model only learns to generate responses, not repeat prompts
  • Special token handling -- inserting BOS, EOS, and role-specific tokens as required by the tokenizer
  • Sequence padding and truncation -- normalizing all examples to a fixed sequence length

Packed sequences concatenate multiple short examples into a single training sample for improved GPU utilization. This is particularly important when training on datasets with high variance in example length, where naive padding would waste significant compute.

Usage

Use when preparing training data for supervised fine-tuning of language models. The format choice depends on the data structure:

  • Chat format -- for multi-turn conversational data with system/user/assistant roles
  • Packed sequences -- for throughput optimization when working with short examples
  • Plain format -- for simple prompt-completion pairs without role structure

Each format requires its own dataset class with specific tokenization and masking logic.

Theoretical Basis

The core theoretical concept is answer-only loss masking: the cross-entropy loss is computed only on response tokens, treating prompt tokens as context that the model conditions on but is not penalized for.

For a sequence: [PROMPT_TOKENS] [RESPONSE_TOKENS] [PAD_TOKENS]
Labels become:  [-100 ... -100] [RESPONSE_IDS]  [-100 ... -100]

Where -100 is the ignore_index for cross-entropy loss.

For packed sequences, additional mechanisms prevent cross-contamination between concatenated examples:

Packed sequence: [Example_A tokens] [Example_B tokens] [Example_C tokens]
Attention mask:  cu_seqlens = [0, len_A, len_A + len_B, len_A + len_B + len_C]

Each example attends only to its own tokens via custom attention masking.
The cu_seqlens (cumulative sequence lengths) define attention boundaries.

The dataset returns tokenized tensors with the following structure:

  • input_ids -- the full tokenized sequence
  • labels -- token IDs for response positions, -100 elsewhere
  • attention_mask -- standard causal mask (or cu_seqlens for packed)
  • loss_mask -- binary mask indicating which positions contribute to loss

Related Pages

Knowledge Sources

NLP | Data_Engineering

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment