Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CarperAI Trlx Q Value Guided Generation

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Offline_RL, Text_Generation
Last Updated 2026-02-07 16:00 GMT

Overview

A generation principle that uses learned Q-value and value function estimates to guide autoregressive token sampling toward higher-reward sequences.

Description

After ILQL training, the model has learned Q-value heads (estimating action values) and value heads (estimating state values) alongside the standard language model logits. During generation, these heads are used to modify token probabilities: tokens with higher advantage (Q - V) are upweighted, biasing generation toward sequences that the Q-function predicts will receive higher rewards.

This is distinct from standard language model sampling which only uses the base model logits. Q-value guided generation adds an "advantage bonus" to the log-probabilities before sampling, with a temperature parameter (beta) controlling the strength of guidance. Higher beta means stronger reward optimization but potentially less fluent text.

Usage

Use Q-value guided generation at inference time after ILQL training to generate text that is biased toward higher rewards. The generation method replaces HuggingFace's standard .generate() with a custom loop that incorporates Q-value information at each token step. Control the reward-fluency tradeoff via the beta parameter.

Theoretical Basis

At each generation step, the modified sampling distribution is:

πguided(a|s)top-k(logπθ(a|s)+β(Q(s,a)V(s)))

Where:

  • logπθ(a|s) is the base language model log-probability
  • Q(s,a) is the learned Q-value (from ILQL Q-heads)
  • V(s) is the learned value estimate (from ILQL V-head)
  • A(s,a)=Q(s,a)V(s) is the advantage
  • β controls guidance strength
  • Top-k filtering removes low-probability tokens after advantage modification

With double Q-learning (two_qs=True), the Q-value is the minimum of two Q-heads to reduce overestimation:

Q(s,a)=min(Q1(s,a),Q2(s,a))

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment