Principle:CarperAI Trlx Offline RL Training

Knowledge Sources	Offline RL for NLP: Benchmarks Baselines and Beyond IQL: Implicit Q-Learning CarperAI trlx
Domains	Reinforcement_Learning, Offline_RL, Training
Last Updated	2026-02-07 16:00 GMT

Overview

A training principle for optimizing language models via offline reinforcement learning using ILQL from pre-collected reward-labeled datasets.

Description

Offline RL training learns a policy from a fixed dataset of text samples with associated reward labels, without generating new samples during training. ILQL (Implicit Language Q-Learning) fits Q-value and value function heads on top of the language model using the offline dataset, then uses Q-value guided sampling at inference time to generate higher-reward text.

The key advantage of offline RL is that it does not require a live reward function during training. This is useful when: the reward model is expensive to evaluate, the training data was collected from human annotations, or you want a simpler training setup. The tradeoff is that the model cannot explore beyond the distribution of the training data.

Usage

Use offline RL training when you have a dataset of text samples with scalar reward labels and want to fine-tune a model to generate higher-reward text. Pass samples and rewards to trlx.train() with an ILQL configuration. The model learns to estimate Q-values from the data and uses them to guide generation at inference time.

Theoretical Basis

Offline RL training with ILQL involves three loss components:

Value loss via expectile regression:

$L_{V} = E_{(s, a) \sim D} [L_{2}^{τ} (Q (s, a) - V (s))]$

Q-value loss with CQL regularization:

$L_{Q} = E_{(s, a, r, s^{'}) \sim D} [(Q (s, a) - (r + γ V (s^{'})))^{2}] + α_{C Q L} \cdot L_{C Q L}$

Policy loss (AWAC-style advantage-weighted):

$L_{π} = - E_{(s, a) \sim D} [\exp (β \cdot A (s, a)) \cdot \log π (a | s)]$

The training processes the offline dataset in batches, computing all three losses jointly. Target Q-networks are updated via Polyak averaging for stability.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment