Implementation:NVIDIA NeMo Aligner Build SFT Dataset
| Implementation Metadata | |
|---|---|
| Name | Build_SFT_Dataset |
| Type | API Doc |
| Implements Principle | SFT_Data_Preparation |
| Repository | NeMo Aligner |
| File | nemo_aligner/data/nlp/builders.py |
| Lines | L402-421 |
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for constructing supervised fine-tuning datasets from instruction-response JSONL files provided by the NeMo Aligner data builders module.
Description
The build_sft_dataset function creates a dataset object for SFT training. It selects the appropriate dataset class based on configuration: GPTSFTChatDataset for conversational data, GPTSFTPackedDataset for packed sequences, or GPTSFTDataset for plain prompt-completion pairs. The function configures tokenization, sequence length limits, answer-only loss masking, and special token handling.
Usage
Import when setting up data for SFT training. Called after model loading to create the training dataset that will be wrapped in a DataLoader.
Code Reference
Source Location
- Repository: NeMo Aligner
- File:
nemo_aligner/data/nlp/builders.py - Lines: L402-421
Signature
def build_sft_dataset(
data_cfg: DictConfig,
tokenizer,
num_samples: int,
answer_only_loss: bool = True,
is_chat: bool = True,
special_tokens: dict = None,
) -> Dataset:
Import
from nemo_aligner.data.nlp.builders import build_sft_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_cfg | DictConfig | Yes | Data configuration with file_path, max_seq_length, min_seq_length, etc. |
| tokenizer | TokenizerSpec | Yes | Tokenizer from the pretrained model |
| num_samples | int | Yes | Number of training samples |
| answer_only_loss | bool | No | Compute loss only on answer tokens (default True) |
| is_chat | bool | No | Use chat-formatted dataset class (default True) |
| special_tokens | dict | No | Special token overrides |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | GPTSFTChatDataset, GPTSFTPackedDataset, or GPTSFTDataset instance |
Usage Examples
Building a Chat-Format SFT Dataset
from nemo_aligner.data.nlp.builders import build_sft_dataset
# Build chat-format SFT dataset
train_ds = build_sft_dataset(
data_cfg=cfg.model.data.train_ds,
tokenizer=model.tokenizer,
num_samples=cfg.model.data.train_ds.num_samples,
answer_only_loss=True,
is_chat=True,
)
Related Pages
- Principle:NVIDIA_NeMo_Aligner_SFT_Data_Preparation
- Environment:NVIDIA_NeMo_Aligner_NeMo_Framework_GPU_Environment