Principle:CarperAI Trlx Q Value Guided Generation

Knowledge Sources	Offline RL for NLP: Benchmarks Baselines and Beyond IQL: Implicit Q-Learning CarperAI trlx
Domains	Reinforcement_Learning, Offline_RL, Text_Generation
Last Updated	2026-02-07 16:00 GMT

Overview

A generation principle that uses learned Q-value and value function estimates to guide autoregressive token sampling toward higher-reward sequences.

Description

After ILQL training, the model has learned Q-value heads (estimating action values) and value heads (estimating state values) alongside the standard language model logits. During generation, these heads are used to modify token probabilities: tokens with higher advantage (Q - V) are upweighted, biasing generation toward sequences that the Q-function predicts will receive higher rewards.

This is distinct from standard language model sampling which only uses the base model logits. Q-value guided generation adds an "advantage bonus" to the log-probabilities before sampling, with a temperature parameter (beta) controlling the strength of guidance. Higher beta means stronger reward optimization but potentially less fluent text.

Usage

Use Q-value guided generation at inference time after ILQL training to generate text that is biased toward higher rewards. The generation method replaces HuggingFace's standard .generate() with a custom loop that incorporates Q-value information at each token step. Control the reward-fluency tradeoff via the beta parameter.

Theoretical Basis

At each generation step, the modified sampling distribution is:

$π_{g u i d e d} (a | s) \propto top-k (\log π_{θ} (a | s) + β \cdot (Q (s, a) - V (s)))$

Where:

$\log π_{θ} (a | s)$ is the base language model log-probability
$Q (s, a)$ is the learned Q-value (from ILQL Q-heads)
$V (s)$ is the learned value estimate (from ILQL V-head)
$A (s, a) = Q (s, a) - V (s)$ is the advantage
$β$ controls guidance strength
Top-k filtering removes low-probability tokens after advantage modification

With double Q-learning (two_qs=True), the Q-value is the minimum of two Q-heads to reduce overestimation:

$Q (s, a) = \min (Q_{1} (s, a), Q_{2} (s, a))$

Related Pages

Implemented By

Implementation:CarperAI_Trlx_ILQL_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment