Principle:CarperAI Trlx Offline RL Training
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Offline_RL, Training |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
A training principle for optimizing language models via offline reinforcement learning using ILQL from pre-collected reward-labeled datasets.
Description
Offline RL training learns a policy from a fixed dataset of text samples with associated reward labels, without generating new samples during training. ILQL (Implicit Language Q-Learning) fits Q-value and value function heads on top of the language model using the offline dataset, then uses Q-value guided sampling at inference time to generate higher-reward text.
The key advantage of offline RL is that it does not require a live reward function during training. This is useful when: the reward model is expensive to evaluate, the training data was collected from human annotations, or you want a simpler training setup. The tradeoff is that the model cannot explore beyond the distribution of the training data.
Usage
Use offline RL training when you have a dataset of text samples with scalar reward labels and want to fine-tune a model to generate higher-reward text. Pass samples and rewards to trlx.train() with an ILQL configuration. The model learns to estimate Q-values from the data and uses them to guide generation at inference time.
Theoretical Basis
Offline RL training with ILQL involves three loss components:
Value loss via expectile regression:
Q-value loss with CQL regularization:
Policy loss (AWAC-style advantage-weighted):
The training processes the offline dataset in batches, computing all three losses jointly. Target Q-networks are updated via Polyak averaging for stability.