Workflow:CarperAI Trlx ILQL Offline Training

Knowledge Sources	trlx Implicit Language Q-Learning trlx Documentation
Domains	LLMs, Reinforcement_Learning, Offline_RL
Last Updated	2026-02-07 16:00 GMT

Overview

End-to-end process for offline reinforcement learning fine-tuning of language models using Implicit Language Q-Learning (ILQL) from a pre-collected reward-labeled dataset.

Description

This workflow trains a language model to generate high-reward text using offline RL, without needing a live reward function at training time. Instead of generating new samples and scoring them online (as PPO does), ILQL learns from a fixed dataset of text samples paired with scalar reward labels. The method trains Q-value and value heads on top of the frozen or fine-tuned language model, using Conservative Q-Learning (CQL) regularization and Advantage Weighted Actor-Critic (AWAC) objectives. At inference time, the Q-values guide generation toward high-reward sequences via modified sampling.

Usage

Execute this workflow when you have a dataset of text samples with pre-computed reward labels (e.g., human preference scores, classifier outputs) and want to train a model offline without an active reward function. This is appropriate when reward computation is expensive, when you have existing preference data, or when you want to avoid the instabilities of online RL training.

Execution Steps

Step 1: Configure training

Set up the training configuration by loading a default ILQL config and optionally overriding hyperparameters. The configuration specifies the base model, tokenizer, optimizer, scheduler, and ILQL-specific parameters: tau (temperature for Gumbel softmax), gamma (discount factor), cql_scale (conservative Q-learning regularization weight), awac_scale, alpha (CQL penalty), and whether to use two Q-heads for double Q-learning.

Key considerations:

ILQL benefits from unfreezing all layers (num_layers_unfrozen = -1)
The two_qs option enables double Q-learning for more stable value estimates
tau controls the trade-off between exploitation and exploration at inference time
gamma should typically be close to 1.0 for language generation tasks

Step 2: Prepare reward-labeled dataset

Load or construct a dataset consisting of text samples paired with scalar reward values. Each sample is a string (or prompt-completion pair) and each reward is a float. The samples and rewards are passed directly to the training API. For preference data, map chosen/rejected pairs to positive/negative reward values.

Key considerations:

Ensure samples and rewards lists are the same length
For preference datasets, a common mapping is chosen=1, rejected=-1
Prompt-completion pairs can be passed as nested lists: [[prompt, completion], ...]
Data quality directly impacts offline RL performance since there is no online correction

Step 3: Define evaluation metrics

Optionally define a metric function that evaluates generated samples during training. Unlike PPO, the metric function is not used for training but only for monitoring. It takes generated samples and returns a dictionary mapping metric names to lists of scalar values.

Key considerations:

The metric function runs on evaluation prompts at eval_interval steps
Use it to track reward quality, diversity, or task-specific metrics
The metric function can use a reward model or classifier for scoring

Step 4: Launch ILQL training

Call trlx.train() with the samples, rewards, evaluation prompts, optional metric function, and configuration. This dispatches to the AccelerateILQLTrainer, which loads the dataset into an offline pipeline, initializes Q-value and value heads, and runs the ILQL training loop. The trainer computes Q-learning losses on the fixed dataset and updates the model to maximize expected returns.

Key considerations:

ILQL training processes the fixed dataset in epochs rather than generating new rollouts
The Q-head and V-head are small linear layers added on top of the transformer
steps_for_target_q_sync controls how often target Q-networks are updated
Training is more stable than PPO since it operates on fixed data

Step 5: Generate and evaluate

After training, use the trained model to generate text. ILQL modifies the sampling process at inference time using the learned Q-values: the generation temperature (beta parameter in gen_kwargs) controls how strongly Q-values influence token selection. Higher beta values produce more reward-optimized but less diverse outputs.

Key considerations:

The beta parameter in gen_kwargs controls Q-value influence during generation
Multiple beta values can be passed as a list for comparative evaluation
The model can be saved in HuggingFace format with trainer.save_pretrained()

Execution Diagram

GitHub URL

Workflow Repository