Principle:NVIDIA NeMo Aligner SteerLM Training

Knowledge Sources	NVIDIA_NeMo_Aligner
Domains	Natural Language Processing, Controllable Generation, Alignment
Last Updated	2026-02-08 00:00 GMT

Overview

SteerLM is an attribute-conditioned training approach that enables controllable text generation by conditioning on attribute labels (e.g., helpfulness, creativity, verbosity), and SteerLM v2 extends this with iterative training using importance-weighted baseline corrections.

Description

SteerLM is an alignment method that steers language model outputs by conditioning generation on explicit attribute labels. Unlike RLHF (Reinforcement Learning from Human Feedback), which learns a single scalar reward, SteerLM trains models to be responsive to multi-dimensional quality attributes such as helpfulness, correctness, creativity, and verbosity.

SteerLM v2 (implemented in the codebase) builds upon this foundation with an iterative training approach:

Baseline weight computation -- For each group of generated responses, the model first computes a forward pass to obtain the negative log-likelihood (NLL) for each response. These NLLs, combined with the pre-stored log-probabilities $\log Q (y | a, x)$ , form the baseline distribution.

Importance-weighted training -- The weight for each response is computed as the difference between the target importance sampling weights (ws) and the baseline softmax probabilities. This ensures that training updates push the model toward the target distribution while accounting for the current model's distribution.

Per-response weighting -- The loss for each token is weighted by the per-response importance weight and normalized by the average number of valid tokens in the micro-batch.

The model tracks a distance metric that measures the KL divergence between the target importance sampling distribution and the current model's baseline distribution, providing a measure of training progress.

Usage

SteerLM training is used when:

You want to build a language model whose generation style can be controlled at inference time by specifying desired attribute values.
You have a dataset annotated with multi-dimensional quality labels (e.g., helpfulness scores, creativity scores).
You want an alternative to RLHF that provides more fine-grained control over generation characteristics.
You want to iteratively improve model alignment using importance-weighted corrections (SteerLM v2).

Theoretical Basis

SteerLM v2 Objective:

For a prompt $x$ with attribute vector $a$ and a set of $N$ candidate responses ${y_{1}, \dots, y_{N}}$ , the training objective uses importance-weighted policy gradient:

Baseline computation: $b_{i} = softmax (- NLL (y_{i} | x) - \log Q (y_{i} | a, x))$

Weight computation: $w_{i} = w_{i}^{target} - b_{i}$

where $w_{i}^{target}$ is the target importance sampling weight derived from the attribute-conditioned reward.

Weighted loss: $ℒ = \sum_{i} w_{i} \cdot \sum_{t} \log P_{θ} (y_{i, t} | y_{i, < t}, x)$

Distance metric (KL divergence): $D_{KL} = \sum_{i} w_{i}^{target} \log \frac{w_{i}^{target}}{b_{i}}$

This distance decreases as the model's distribution approaches the target, providing a convergence signal.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment