Implementation:NVIDIA NeMo Aligner Build SFT Dataset

Implementation Metadata
Name	Build_SFT_Dataset
Type	API Doc
Implements Principle	SFT_Data_Preparation
Repository	NeMo Aligner
File	nemo_aligner/data/nlp/builders.py
Lines	L402-421
Domains	NLP, Data_Engineering
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for constructing supervised fine-tuning datasets from instruction-response JSONL files provided by the NeMo Aligner data builders module.

Description

The build_sft_dataset function creates a dataset object for SFT training. It selects the appropriate dataset class based on configuration: GPTSFTChatDataset for conversational data, GPTSFTPackedDataset for packed sequences, or GPTSFTDataset for plain prompt-completion pairs. The function configures tokenization, sequence length limits, answer-only loss masking, and special token handling.

Usage

Import when setting up data for SFT training. Called after model loading to create the training dataset that will be wrapped in a DataLoader.

Code Reference

Source Location

Repository: NeMo Aligner
File: nemo_aligner/data/nlp/builders.py
Lines: L402-421

Signature

def build_sft_dataset(
    data_cfg: DictConfig,
    tokenizer,
    num_samples: int,
    answer_only_loss: bool = True,
    is_chat: bool = True,
    special_tokens: dict = None,
) -> Dataset:

Import

from nemo_aligner.data.nlp.builders import build_sft_dataset

I/O Contract

Inputs

Name	Type	Required	Description
data_cfg	DictConfig	Yes	Data configuration with file_path, max_seq_length, min_seq_length, etc.
tokenizer	TokenizerSpec	Yes	Tokenizer from the pretrained model
num_samples	int	Yes	Number of training samples
answer_only_loss	bool	No	Compute loss only on answer tokens (default True)
is_chat	bool	No	Use chat-formatted dataset class (default True)
special_tokens	dict	No	Special token overrides

Outputs

Name	Type	Description
dataset	Dataset	GPTSFTChatDataset, GPTSFTPackedDataset, or GPTSFTDataset instance

Usage Examples

Building a Chat-Format SFT Dataset

from nemo_aligner.data.nlp.builders import build_sft_dataset

# Build chat-format SFT dataset
train_ds = build_sft_dataset(
    data_cfg=cfg.model.data.train_ds,
    tokenizer=model.tokenizer,
    num_samples=cfg.model.data.train_ds.num_samples,
    answer_only_loss=True,
    is_chat=True,
)

Related Pages

Knowledge Sources

NeMo Aligner

NLP | Data_Engineering

2026-02-07 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment