Implementation:NVIDIA NeMo Aligner Preprocess HelpSteer Data

Knowledge Sources	NVIDIA_NeMo_Aligner
Domains	SteerLM, Data Preprocessing, HelpSteer
Last Updated	2026-02-08 00:00 GMT

Overview

A script that downloads the NVIDIA HelpSteer dataset from HuggingFace and converts it into the attribute-conditioned SFT training format used by SteerLM.

Description

preprocess_helpsteer_data.py automates the download and conversion of the nvidia/HelpSteer dataset from HuggingFace into the NeMo Aligner training format. The processing pipeline:

Dataset download: Fetches the nvidia/HelpSteer dataset from HuggingFace, including both train and validation splits.
Format conversion: Each data point is converted into a conversation object with:
- A system prompt ("A chat between a curious user and an artificial intelligence assistant...")
- A conversations array with a User turn (prompt) and an Assistant turn (response)
- The Assistant turn's label field containing the 5 HelpSteer attribute scores: helpfulness, correctness, coherence, complexity, verbosity (e.g., helpfulness:3,correctness:4,coherence:3,complexity:2,verbosity:1)
- A mask field set to "User" (indicating which role to mask during training)
- A type field set to "VALUE_TO_TEXT" (indicating the training task type)
Output writing: Saves train.jsonl and val.jsonl to the specified output directory.

Usage

Use this script when:

You need to prepare the HelpSteer dataset for SteerLM reward model training
You want to create attribute-conditioned SFT training data from HelpSteer
You are setting up the SteerLM training pipeline

Code Reference

Source Location

Repository: NVIDIA_NeMo_Aligner
File: examples/nlp/data/steerlm/preprocess_helpsteer_data.py
Lines: 1-82

Signature

download_helpsteer:

def download_helpsteer():

format_label:

def format_label(dp):

process_dataset:

def process_dataset(data):

main:

def main(output_dir):

Import

from preprocess_helpsteer_data import download_helpsteer, format_label, process_dataset

I/O Contract

Inputs

Name	Type	Required	Description
-dir / --output_directory	`str`	Yes	Output folder for train.jsonl and val.jsonl; created if it does not exist

Outputs

Name	Type	Description
train.jsonl	JSONL file	Training split in attribute-conditioned SFT format with HelpSteer attribute labels
val.jsonl	JSONL file	Validation split in attribute-conditioned SFT format with HelpSteer attribute labels

Each output line is a JSON object with the following structure:

{
  "system": "A chat between a curious user and an artificial intelligence assistant...",
  "conversations": [
    {"value": "<prompt>", "from": "User", "label": null},
    {"value": "<response>", "from": "Assistant", "label": "helpfulness:3,correctness:4,coherence:3,complexity:2,verbosity:1"}
  ],
  "mask": "User",
  "type": "VALUE_TO_TEXT"
}

Usage Examples

# Command-line usage:
python preprocess_helpsteer_data.py --output_directory /data/helpsteer_processed

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment