Principle:NVIDIA NeMo Aligner SteerLM Training
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Controllable Generation, Alignment |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
SteerLM is an attribute-conditioned training approach that enables controllable text generation by conditioning on attribute labels (e.g., helpfulness, creativity, verbosity), and SteerLM v2 extends this with iterative training using importance-weighted baseline corrections.
Description
SteerLM is an alignment method that steers language model outputs by conditioning generation on explicit attribute labels. Unlike RLHF (Reinforcement Learning from Human Feedback), which learns a single scalar reward, SteerLM trains models to be responsive to multi-dimensional quality attributes such as helpfulness, correctness, creativity, and verbosity.
SteerLM v2 (implemented in the codebase) builds upon this foundation with an iterative training approach:
- Baseline weight computation -- For each group of generated responses, the model first computes a forward pass to obtain the negative log-likelihood (NLL) for each response. These NLLs, combined with the pre-stored log-probabilities , form the baseline distribution.
- Importance-weighted training -- The weight for each response is computed as the difference between the target importance sampling weights (ws) and the baseline softmax probabilities. This ensures that training updates push the model toward the target distribution while accounting for the current model's distribution.
- Per-response weighting -- The loss for each token is weighted by the per-response importance weight and normalized by the average number of valid tokens in the micro-batch.
The model tracks a distance metric that measures the KL divergence between the target importance sampling distribution and the current model's baseline distribution, providing a measure of training progress.
Usage
SteerLM training is used when:
- You want to build a language model whose generation style can be controlled at inference time by specifying desired attribute values.
- You have a dataset annotated with multi-dimensional quality labels (e.g., helpfulness scores, creativity scores).
- You want an alternative to RLHF that provides more fine-grained control over generation characteristics.
- You want to iteratively improve model alignment using importance-weighted corrections (SteerLM v2).
Theoretical Basis
SteerLM v2 Objective:
For a prompt with attribute vector and a set of candidate responses , the training objective uses importance-weighted policy gradient:
Baseline computation:
Weight computation:
where is the target importance sampling weight derived from the attribute-conditioned reward.
Weighted loss:
Distance metric (KL divergence):
This distance decreases as the model's distribution approaches the target, providing a convergence signal.