Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Aligner Preprocess HelpSteer Data

From Leeroopedia


Knowledge Sources
Domains SteerLM, Data Preprocessing, HelpSteer
Last Updated 2026-02-08 00:00 GMT

Overview

A script that downloads the NVIDIA HelpSteer dataset from HuggingFace and converts it into the attribute-conditioned SFT training format used by SteerLM.

Description

preprocess_helpsteer_data.py automates the download and conversion of the nvidia/HelpSteer dataset from HuggingFace into the NeMo Aligner training format. The processing pipeline:

  1. Dataset download: Fetches the nvidia/HelpSteer dataset from HuggingFace, including both train and validation splits.
  2. Format conversion: Each data point is converted into a conversation object with:
    • A system prompt ("A chat between a curious user and an artificial intelligence assistant...")
    • A conversations array with a User turn (prompt) and an Assistant turn (response)
    • The Assistant turn's label field containing the 5 HelpSteer attribute scores: helpfulness, correctness, coherence, complexity, verbosity (e.g., helpfulness:3,correctness:4,coherence:3,complexity:2,verbosity:1)
    • A mask field set to "User" (indicating which role to mask during training)
    • A type field set to "VALUE_TO_TEXT" (indicating the training task type)
  3. Output writing: Saves train.jsonl and val.jsonl to the specified output directory.

Usage

Use this script when:

  • You need to prepare the HelpSteer dataset for SteerLM reward model training
  • You want to create attribute-conditioned SFT training data from HelpSteer
  • You are setting up the SteerLM training pipeline

Code Reference

Source Location

  • Repository: NVIDIA_NeMo_Aligner
  • File: examples/nlp/data/steerlm/preprocess_helpsteer_data.py
  • Lines: 1-82

Signature

download_helpsteer:

def download_helpsteer():

format_label:

def format_label(dp):

process_dataset:

def process_dataset(data):

main:

def main(output_dir):

Import

from preprocess_helpsteer_data import download_helpsteer, format_label, process_dataset

I/O Contract

Inputs

Name Type Required Description
-dir / --output_directory str Yes Output folder for train.jsonl and val.jsonl; created if it does not exist

Outputs

Name Type Description
train.jsonl JSONL file Training split in attribute-conditioned SFT format with HelpSteer attribute labels
val.jsonl JSONL file Validation split in attribute-conditioned SFT format with HelpSteer attribute labels

Each output line is a JSON object with the following structure:

{
  "system": "A chat between a curious user and an artificial intelligence assistant...",
  "conversations": [
    {"value": "<prompt>", "from": "User", "label": null},
    {"value": "<response>", "from": "Assistant", "label": "helpfulness:3,correctness:4,coherence:3,complexity:2,verbosity:1"}
  ],
  "mask": "User",
  "type": "VALUE_TO_TEXT"
}

Usage Examples

# Command-line usage:
python preprocess_helpsteer_data.py --output_directory /data/helpsteer_processed

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment