Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CarperAI Trlx Offline RL Training

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Offline_RL, Training
Last Updated 2026-02-07 16:00 GMT

Overview

A training principle for optimizing language models via offline reinforcement learning using ILQL from pre-collected reward-labeled datasets.

Description

Offline RL training learns a policy from a fixed dataset of text samples with associated reward labels, without generating new samples during training. ILQL (Implicit Language Q-Learning) fits Q-value and value function heads on top of the language model using the offline dataset, then uses Q-value guided sampling at inference time to generate higher-reward text.

The key advantage of offline RL is that it does not require a live reward function during training. This is useful when: the reward model is expensive to evaluate, the training data was collected from human annotations, or you want a simpler training setup. The tradeoff is that the model cannot explore beyond the distribution of the training data.

Usage

Use offline RL training when you have a dataset of text samples with scalar reward labels and want to fine-tune a model to generate higher-reward text. Pass samples and rewards to trlx.train() with an ILQL configuration. The model learns to estimate Q-values from the data and uses them to guide generation at inference time.

Theoretical Basis

Offline RL training with ILQL involves three loss components:

Value loss via expectile regression:

LV=E(s,a)D[L2τ(Q(s,a)V(s))]

Q-value loss with CQL regularization:

LQ=E(s,a,r,s)D[(Q(s,a)(r+γV(s)))2]+αCQLLCQL

Policy loss (AWAC-style advantage-weighted):

Lπ=E(s,a)D[exp(βA(s,a))logπ(a|s)]

The training processes the offline dataset in batches, computing all three losses jointly. Target Q-networks are updated via Polyak averaging for stability.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment