Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:NVIDIA NeMo Aligner SteerLM Data Preparation

From Leeroopedia


Knowledge Sources
Domains SteerLM, Data Preprocessing, Reward Modeling
Last Updated 2026-02-08 00:00 GMT

Overview

SteerLM data preparation is the process of annotating conversational data with multi-dimensional attribute scores and converting it into formats suitable for training SteerLM reward models and attribute-conditioned SFT models.

Description

SteerLM is NVIDIA's approach to controllable language model alignment that steers model behavior using multiple attribute dimensions rather than a single scalar reward. The data preparation pipeline involves several stages:

  1. Attribute annotation: Conversational datasets are annotated with scores across multiple attribute dimensions. The full set of SteerLM attributes includes:
    • Open Assistant attributes: quality, toxicity, humor, creativity
    • HelpSteer attributes: helpfulness, correctness, coherence, complexity, verbosity
    Annotation is performed by sending conversations to a regression reward model server, which predicts scores (clamped to 0-4 range) for each attribute at each assistant turn.
  1. Dataset preprocessing: The HelpSteer and HelpSteer2 datasets from NVIDIA on HuggingFace are converted from their native format into the attribute-conditioned SFT training format. Each conversation is structured with a system prompt, User/Assistant turns, and label strings containing comma-separated attribute:score pairs (e.g., helpfulness:3,correctness:4,coherence:3,complexity:2,verbosity:1).
  1. Regression format conversion: The attribute-conditioned SFT format is further converted into regression reward model training format, where each labeled turn produces a training sample consisting of the conversation text up to that turn (with a <extra_id_2> label prefix) and a numeric label vector with one float per attribute.

The NeMo chat template uses special tokens: <extra_id_0> for System, <extra_id_1> for User/Assistant turns, and <extra_id_2> as the label prefix.

Usage

Use SteerLM data preparation when:

  • You are training a SteerLM regression reward model that predicts multi-dimensional attribute scores
  • You are preparing data for attribute-conditioned supervised fine-tuning
  • You need to annotate conversations with quality, helpfulness, correctness, and other attributes
  • You are working with the NVIDIA HelpSteer or HelpSteer2 datasets

Theoretical Basis

SteerLM is based on the principle that alignment is multidimensional. Rather than collapsing human preferences into a single scalar reward (as in standard RLHF), SteerLM decomposes alignment into multiple interpretable attributes:

  • Helpfulness: How well the response addresses the user's request
  • Correctness: Factual accuracy of the response
  • Coherence: Logical consistency and readability
  • Complexity: Technical depth and sophistication
  • Verbosity: Length and detail level of the response

By training a regression reward model that predicts scores along each dimension independently, SteerLM enables controllable generation at inference time -- users or system designers can specify desired attribute values to steer the model's output. This approach provides more fine-grained control than binary preference optimization and allows trading off between different quality dimensions.

The regression format uses a label vector where each position corresponds to a specific attribute, with -100 used as a sentinel value for attributes not present in a given sample's annotations.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment