Principle:NVIDIA NeMo Aligner SteerLM Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | SteerLM, Data Preprocessing, Reward Modeling |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
SteerLM data preparation is the process of annotating conversational data with multi-dimensional attribute scores and converting it into formats suitable for training SteerLM reward models and attribute-conditioned SFT models.
Description
SteerLM is NVIDIA's approach to controllable language model alignment that steers model behavior using multiple attribute dimensions rather than a single scalar reward. The data preparation pipeline involves several stages:
- Attribute annotation: Conversational datasets are annotated with scores across multiple attribute dimensions. The full set of SteerLM attributes includes:
- Open Assistant attributes: quality, toxicity, humor, creativity
- HelpSteer attributes: helpfulness, correctness, coherence, complexity, verbosity
- Annotation is performed by sending conversations to a regression reward model server, which predicts scores (clamped to 0-4 range) for each attribute at each assistant turn.
- Dataset preprocessing: The HelpSteer and HelpSteer2 datasets from NVIDIA on HuggingFace are converted from their native format into the attribute-conditioned SFT training format. Each conversation is structured with a system prompt, User/Assistant turns, and label strings containing comma-separated attribute:score pairs (e.g.,
helpfulness:3,correctness:4,coherence:3,complexity:2,verbosity:1).
- Regression format conversion: The attribute-conditioned SFT format is further converted into regression reward model training format, where each labeled turn produces a training sample consisting of the conversation text up to that turn (with a
<extra_id_2>label prefix) and a numeric label vector with one float per attribute.
The NeMo chat template uses special tokens: <extra_id_0> for System, <extra_id_1> for User/Assistant turns, and <extra_id_2> as the label prefix.
Usage
Use SteerLM data preparation when:
- You are training a SteerLM regression reward model that predicts multi-dimensional attribute scores
- You are preparing data for attribute-conditioned supervised fine-tuning
- You need to annotate conversations with quality, helpfulness, correctness, and other attributes
- You are working with the NVIDIA HelpSteer or HelpSteer2 datasets
Theoretical Basis
SteerLM is based on the principle that alignment is multidimensional. Rather than collapsing human preferences into a single scalar reward (as in standard RLHF), SteerLM decomposes alignment into multiple interpretable attributes:
- Helpfulness: How well the response addresses the user's request
- Correctness: Factual accuracy of the response
- Coherence: Logical consistency and readability
- Complexity: Technical depth and sophistication
- Verbosity: Length and detail level of the response
By training a regression reward model that predicts scores along each dimension independently, SteerLM enables controllable generation at inference time -- users or system designers can specify desired attribute values to steer the model's output. This approach provides more fine-grained control than binary preference optimization and allows trading off between different quality dimensions.
The regression format uses a label vector where each position corresponds to a specific attribute, with -100 used as a sentinel value for attributes not present in a given sample's annotations.