Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CarperAI Trlx ILQL Configuration

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Offline_RL, Configuration
Last Updated 2026-02-07 16:00 GMT

Overview

A configuration principle that defines the hyperparameters for Implicit Language Q-Learning, an offline reinforcement learning method for language model alignment.

Description

ILQL (Implicit Language Q-Learning) is an offline RL algorithm adapted from Implicit Q-Learning (IQL) for language model fine-tuning. Unlike PPO which requires a live reward function and on-policy generation, ILQL learns from pre-collected datasets of text samples with associated reward labels. It fits Q-value and value function heads on top of the language model and uses expectile regression, Conservative Q-Learning (CQL), and AWAC-style advantage-weighted policy extraction.

Configuring ILQL requires setting parameters that control the offline RL components: the expectile parameter (tau), discount factor (gamma), CQL and AWAC loss scales, Polyak averaging for target network synchronization (alpha), and advantage weighting strength (beta). These parameters directly influence how the model balances staying close to the data distribution versus optimizing for higher rewards.

Usage

Use ILQL configuration when you have a static dataset of text samples with scalar reward labels and want to fine-tune a language model to generate higher-reward text without needing a live reward function. ILQL is preferred over PPO when: (1) evaluating a reward function is expensive, (2) you have pre-collected preference data, or (3) you want a simpler training loop without on-policy generation.

Theoretical Basis

ILQL extends Implicit Q-Learning to autoregressive language models. The core components are:

Q-value estimation via expectile regression:

LV(ψ)=E(s,a)D[L2τ(Qθ(s,a)Vψ(s))]

where L2τ(u)=|τ𝟙(u<0)|u2 is the asymmetric loss.

Conservative Q-Learning regularization penalizes overestimation:

LCQL=αCQLEsD[logaexp(Q(s,a))EaD[Q(s,a)]]

AWAC-style policy extraction uses advantages to weight the policy:

π(a|s)πβ(a|s)exp(1β(Q(s,a)V(s)))

Key configuration parameters:

  • tau → Expectile parameter (0 to 1, higher = more optimistic)
  • gamma → Discount factor for future rewards
  • cql_scale → Weight of the CQL regularization loss
  • awac_scale → Weight of the AWAC policy loss
  • alpha → Polyak averaging rate for target Q-network updates
  • beta → Advantage weighting temperature for generation
  • two_qs → Whether to use double Q-heads (reduces overestimation)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment