Principle:CarperAI Trlx ILQL Configuration
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Offline_RL, Configuration |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
A configuration principle that defines the hyperparameters for Implicit Language Q-Learning, an offline reinforcement learning method for language model alignment.
Description
ILQL (Implicit Language Q-Learning) is an offline RL algorithm adapted from Implicit Q-Learning (IQL) for language model fine-tuning. Unlike PPO which requires a live reward function and on-policy generation, ILQL learns from pre-collected datasets of text samples with associated reward labels. It fits Q-value and value function heads on top of the language model and uses expectile regression, Conservative Q-Learning (CQL), and AWAC-style advantage-weighted policy extraction.
Configuring ILQL requires setting parameters that control the offline RL components: the expectile parameter (tau), discount factor (gamma), CQL and AWAC loss scales, Polyak averaging for target network synchronization (alpha), and advantage weighting strength (beta). These parameters directly influence how the model balances staying close to the data distribution versus optimizing for higher rewards.
Usage
Use ILQL configuration when you have a static dataset of text samples with scalar reward labels and want to fine-tune a language model to generate higher-reward text without needing a live reward function. ILQL is preferred over PPO when: (1) evaluating a reward function is expensive, (2) you have pre-collected preference data, or (3) you want a simpler training loop without on-policy generation.
Theoretical Basis
ILQL extends Implicit Q-Learning to autoregressive language models. The core components are:
Q-value estimation via expectile regression:
where is the asymmetric loss.
Conservative Q-Learning regularization penalizes overestimation:
AWAC-style policy extraction uses advantages to weight the policy:
Key configuration parameters:
- tau → Expectile parameter (0 to 1, higher = more optimistic)
- gamma → Discount factor for future rewards
- cql_scale → Weight of the CQL regularization loss
- awac_scale → Weight of the AWAC policy loss
- alpha → Polyak averaging rate for target Q-network updates
- beta → Advantage weighting temperature for generation
- two_qs → Whether to use double Q-heads (reduces overestimation)