Principle:CarperAI Trlx ILQL Configuration

Knowledge Sources	Offline RL for NLP: Benchmarks Baselines and Beyond IQL: Implicit Q-Learning CarperAI trlx
Domains	Reinforcement_Learning, Offline_RL, Configuration
Last Updated	2026-02-07 16:00 GMT

Overview

A configuration principle that defines the hyperparameters for Implicit Language Q-Learning, an offline reinforcement learning method for language model alignment.

Description

ILQL (Implicit Language Q-Learning) is an offline RL algorithm adapted from Implicit Q-Learning (IQL) for language model fine-tuning. Unlike PPO which requires a live reward function and on-policy generation, ILQL learns from pre-collected datasets of text samples with associated reward labels. It fits Q-value and value function heads on top of the language model and uses expectile regression, Conservative Q-Learning (CQL), and AWAC-style advantage-weighted policy extraction.

Configuring ILQL requires setting parameters that control the offline RL components: the expectile parameter (tau), discount factor (gamma), CQL and AWAC loss scales, Polyak averaging for target network synchronization (alpha), and advantage weighting strength (beta). These parameters directly influence how the model balances staying close to the data distribution versus optimizing for higher rewards.

Usage

Use ILQL configuration when you have a static dataset of text samples with scalar reward labels and want to fine-tune a language model to generate higher-reward text without needing a live reward function. ILQL is preferred over PPO when: (1) evaluating a reward function is expensive, (2) you have pre-collected preference data, or (3) you want a simpler training loop without on-policy generation.

Theoretical Basis

ILQL extends Implicit Q-Learning to autoregressive language models. The core components are:

Q-value estimation via expectile regression:

$L_{V} (ψ) = E_{(s, a) \sim D} [L_{2}^{τ} (Q_{θ} (s, a) - V_{ψ} (s))]$

where $L_{2}^{τ} (u) = | τ - 𝟙 (u < 0) | u^{2}$ is the asymmetric loss.

Conservative Q-Learning regularization penalizes overestimation:

$L_{C Q L} = α_{C Q L} \cdot E_{s \sim D} [\log \sum_{a} \exp (Q (s, a)) - E_{a \sim D} [Q (s, a)]]$

AWAC-style policy extraction uses advantages to weight the policy:

$π (a | s) \propto π_{β} (a | s) \exp (\frac{1}{β} (Q (s, a) - V (s)))$

Key configuration parameters:

tau → Expectile parameter (0 to 1, higher = more optimistic)
gamma → Discount factor for future rewards
cql_scale → Weight of the CQL regularization loss
awac_scale → Weight of the AWAC policy loss
alpha → Polyak averaging rate for target Q-network updates
beta → Advantage weighting temperature for generation
two_qs → Whether to use double Q-heads (reduces overestimation)

Related Pages

Implemented By

Implementation:CarperAI_Trlx_Default_ILQL_Config

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment