Heuristic:LLMBook zh LLMBook zh github io Reward Model LM Regularization

Knowledge Sources	LLMBook-zh Training language models to follow instructions with human feedback
Domains	LLMs, Alignment, RLHF
Last Updated	2026-02-08 04:30 GMT

Overview

Combine contrastive reward loss with a language modeling (LM) loss as regularization to prevent reward model collapse.

Description

The reward model uses a dual-loss architecture: a contrastive reward loss (binary cross-entropy on the difference between chosen and rejected reward scores) plus a language modeling loss (cross-entropy on next-token prediction). The LM loss serves as a regularizer, ensuring the reward model retains its language understanding capabilities rather than collapsing to trivially distinguish chosen/rejected pairs.

Usage

Use this heuristic when training a reward model for RLHF. The LM auxiliary loss prevents the reward head from overfitting to superficial patterns in preference data, maintaining the backbone's representational quality.

The Insight (Rule of Thumb)

Action: Compute `loss = rm_loss + lm_loss` where rm_loss is binary cross-entropy on reward differences and lm_loss is cross-entropy on next-token prediction.
Value: Equal weighting (1:1 ratio) of both losses.
Trade-off: The LM loss adds computation but significantly improves reward model robustness and prevents degenerate solutions.

Reasoning

Without LM regularization, the reward model can overfit to spurious correlations in the preference data (e.g., response length, specific keywords). The LM loss forces the model to maintain its language modeling capability, which serves as a form of multi-task regularization. The 1:1 ratio is a simple starting point; in practice, a coefficient can be tuned. The contrastive approach (reward_chosen - reward_rejected) with BCE loss ensures the model learns relative preferences rather than absolute reward values.

Code Evidence:

Contrastive reward loss from `code/8.1 奖励模型训练.py:70-72`:

# 计算对比式训练方法的损失函数
logits = reward0 - reward1
rm_loss = F.binary_cross_entropy_with_logits(
    logits, labels.to(logits.dtype), reduction="mean"
)

LM regularization loss from `code/8.1 奖励模型训练.py:74-75`:

# 计算模仿学习的正则项的损失函数
lm_loss = self._forward_lmloss(prompt_ids, lm_attn_mask, response_ids)

Combined loss from `code/8.1 奖励模型训练.py:77-78`:

# 计算最终损失
loss = rm_loss + lm_loss

Reward head architecture from `code/8.1 奖励模型训练.py:12`:

self.reward_head = nn.Linear(config.hidden_size, 1, bias=False)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment