Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Aligner Build SFT Dataset

From Leeroopedia


Implementation Metadata
Name Build_SFT_Dataset
Type API Doc
Implements Principle SFT_Data_Preparation
Repository NeMo Aligner
File nemo_aligner/data/nlp/builders.py
Lines L402-421
Domains NLP, Data_Engineering
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for constructing supervised fine-tuning datasets from instruction-response JSONL files provided by the NeMo Aligner data builders module.

Description

The build_sft_dataset function creates a dataset object for SFT training. It selects the appropriate dataset class based on configuration: GPTSFTChatDataset for conversational data, GPTSFTPackedDataset for packed sequences, or GPTSFTDataset for plain prompt-completion pairs. The function configures tokenization, sequence length limits, answer-only loss masking, and special token handling.

Usage

Import when setting up data for SFT training. Called after model loading to create the training dataset that will be wrapped in a DataLoader.

Code Reference

Source Location

  • Repository: NeMo Aligner
  • File: nemo_aligner/data/nlp/builders.py
  • Lines: L402-421

Signature

def build_sft_dataset(
    data_cfg: DictConfig,
    tokenizer,
    num_samples: int,
    answer_only_loss: bool = True,
    is_chat: bool = True,
    special_tokens: dict = None,
) -> Dataset:

Import

from nemo_aligner.data.nlp.builders import build_sft_dataset

I/O Contract

Inputs

Name Type Required Description
data_cfg DictConfig Yes Data configuration with file_path, max_seq_length, min_seq_length, etc.
tokenizer TokenizerSpec Yes Tokenizer from the pretrained model
num_samples int Yes Number of training samples
answer_only_loss bool No Compute loss only on answer tokens (default True)
is_chat bool No Use chat-formatted dataset class (default True)
special_tokens dict No Special token overrides

Outputs

Name Type Description
dataset Dataset GPTSFTChatDataset, GPTSFTPackedDataset, or GPTSFTDataset instance

Usage Examples

Building a Chat-Format SFT Dataset

from nemo_aligner.data.nlp.builders import build_sft_dataset

# Build chat-format SFT dataset
train_ds = build_sft_dataset(
    data_cfg=cfg.model.data.train_ds,
    tokenizer=model.tokenizer,
    num_samples=cfg.model.data.train_ds.num_samples,
    answer_only_loss=True,
    is_chat=True,
)

Related Pages

Knowledge Sources

NLP | Data_Engineering

2026-02-07 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment