Principle:CarperAI Trlx Q Value Guided Generation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Offline_RL, Text_Generation |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
A generation principle that uses learned Q-value and value function estimates to guide autoregressive token sampling toward higher-reward sequences.
Description
After ILQL training, the model has learned Q-value heads (estimating action values) and value heads (estimating state values) alongside the standard language model logits. During generation, these heads are used to modify token probabilities: tokens with higher advantage (Q - V) are upweighted, biasing generation toward sequences that the Q-function predicts will receive higher rewards.
This is distinct from standard language model sampling which only uses the base model logits. Q-value guided generation adds an "advantage bonus" to the log-probabilities before sampling, with a temperature parameter (beta) controlling the strength of guidance. Higher beta means stronger reward optimization but potentially less fluent text.
Usage
Use Q-value guided generation at inference time after ILQL training to generate text that is biased toward higher rewards. The generation method replaces HuggingFace's standard .generate() with a custom loop that incorporates Q-value information at each token step. Control the reward-fluency tradeoff via the beta parameter.
Theoretical Basis
At each generation step, the modified sampling distribution is:
Where:
- is the base language model log-probability
- is the learned Q-value (from ILQL Q-heads)
- is the learned value estimate (from ILQL V-head)
- is the advantage
- controls guidance strength
- Top-k filtering removes low-probability tokens after advantage modification
With double Q-learning (two_qs=True), the Q-value is the minimum of two Q-heads to reduce overestimation: