Implementation:NVIDIA NeMo Aligner Preprocess HelpSteer Data
Appearance
| Knowledge Sources | |
|---|---|
| Domains | SteerLM, Data Preprocessing, HelpSteer |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A script that downloads the NVIDIA HelpSteer dataset from HuggingFace and converts it into the attribute-conditioned SFT training format used by SteerLM.
Description
preprocess_helpsteer_data.py automates the download and conversion of the nvidia/HelpSteer dataset from HuggingFace into the NeMo Aligner training format. The processing pipeline:
- Dataset download: Fetches the
nvidia/HelpSteerdataset from HuggingFace, including both train and validation splits. - Format conversion: Each data point is converted into a conversation object with:
- A system prompt ("A chat between a curious user and an artificial intelligence assistant...")
- A conversations array with a User turn (prompt) and an Assistant turn (response)
- The Assistant turn's label field containing the 5 HelpSteer attribute scores: helpfulness, correctness, coherence, complexity, verbosity (e.g.,
helpfulness:3,correctness:4,coherence:3,complexity:2,verbosity:1) - A mask field set to "User" (indicating which role to mask during training)
- A type field set to "VALUE_TO_TEXT" (indicating the training task type)
- Output writing: Saves train.jsonl and val.jsonl to the specified output directory.
Usage
Use this script when:
- You need to prepare the HelpSteer dataset for SteerLM reward model training
- You want to create attribute-conditioned SFT training data from HelpSteer
- You are setting up the SteerLM training pipeline
Code Reference
Source Location
- Repository: NVIDIA_NeMo_Aligner
- File:
examples/nlp/data/steerlm/preprocess_helpsteer_data.py - Lines: 1-82
Signature
download_helpsteer:
def download_helpsteer():
format_label:
def format_label(dp):
process_dataset:
def process_dataset(data):
main:
def main(output_dir):
Import
from preprocess_helpsteer_data import download_helpsteer, format_label, process_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -dir / --output_directory | str |
Yes | Output folder for train.jsonl and val.jsonl; created if it does not exist |
Outputs
| Name | Type | Description |
|---|---|---|
| train.jsonl | JSONL file | Training split in attribute-conditioned SFT format with HelpSteer attribute labels |
| val.jsonl | JSONL file | Validation split in attribute-conditioned SFT format with HelpSteer attribute labels |
Each output line is a JSON object with the following structure:
{
"system": "A chat between a curious user and an artificial intelligence assistant...",
"conversations": [
{"value": "<prompt>", "from": "User", "label": null},
{"value": "<response>", "from": "Assistant", "label": "helpfulness:3,correctness:4,coherence:3,complexity:2,verbosity:1"}
],
"mask": "User",
"type": "VALUE_TO_TEXT"
}
Usage Examples
# Command-line usage:
python preprocess_helpsteer_data.py --output_directory /data/helpsteer_processed
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment