Workflow:CarperAI Trlx ILQL Offline Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reinforcement_Learning, Offline_RL |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
End-to-end process for offline reinforcement learning fine-tuning of language models using Implicit Language Q-Learning (ILQL) from a pre-collected reward-labeled dataset.
Description
This workflow trains a language model to generate high-reward text using offline RL, without needing a live reward function at training time. Instead of generating new samples and scoring them online (as PPO does), ILQL learns from a fixed dataset of text samples paired with scalar reward labels. The method trains Q-value and value heads on top of the frozen or fine-tuned language model, using Conservative Q-Learning (CQL) regularization and Advantage Weighted Actor-Critic (AWAC) objectives. At inference time, the Q-values guide generation toward high-reward sequences via modified sampling.
Usage
Execute this workflow when you have a dataset of text samples with pre-computed reward labels (e.g., human preference scores, classifier outputs) and want to train a model offline without an active reward function. This is appropriate when reward computation is expensive, when you have existing preference data, or when you want to avoid the instabilities of online RL training.
Execution Steps
Step 1: Configure training
Set up the training configuration by loading a default ILQL config and optionally overriding hyperparameters. The configuration specifies the base model, tokenizer, optimizer, scheduler, and ILQL-specific parameters: tau (temperature for Gumbel softmax), gamma (discount factor), cql_scale (conservative Q-learning regularization weight), awac_scale, alpha (CQL penalty), and whether to use two Q-heads for double Q-learning.
Key considerations:
- ILQL benefits from unfreezing all layers (num_layers_unfrozen = -1)
- The two_qs option enables double Q-learning for more stable value estimates
- tau controls the trade-off between exploitation and exploration at inference time
- gamma should typically be close to 1.0 for language generation tasks
Step 2: Prepare reward-labeled dataset
Load or construct a dataset consisting of text samples paired with scalar reward values. Each sample is a string (or prompt-completion pair) and each reward is a float. The samples and rewards are passed directly to the training API. For preference data, map chosen/rejected pairs to positive/negative reward values.
Key considerations:
- Ensure samples and rewards lists are the same length
- For preference datasets, a common mapping is chosen=1, rejected=-1
- Prompt-completion pairs can be passed as nested lists: [[prompt, completion], ...]
- Data quality directly impacts offline RL performance since there is no online correction
Step 3: Define evaluation metrics
Optionally define a metric function that evaluates generated samples during training. Unlike PPO, the metric function is not used for training but only for monitoring. It takes generated samples and returns a dictionary mapping metric names to lists of scalar values.
Key considerations:
- The metric function runs on evaluation prompts at eval_interval steps
- Use it to track reward quality, diversity, or task-specific metrics
- The metric function can use a reward model or classifier for scoring
Step 4: Launch ILQL training
Call trlx.train() with the samples, rewards, evaluation prompts, optional metric function, and configuration. This dispatches to the AccelerateILQLTrainer, which loads the dataset into an offline pipeline, initializes Q-value and value heads, and runs the ILQL training loop. The trainer computes Q-learning losses on the fixed dataset and updates the model to maximize expected returns.
Key considerations:
- ILQL training processes the fixed dataset in epochs rather than generating new rollouts
- The Q-head and V-head are small linear layers added on top of the transformer
- steps_for_target_q_sync controls how often target Q-networks are updated
- Training is more stable than PPO since it operates on fixed data
Step 5: Generate and evaluate
After training, use the trained model to generate text. ILQL modifies the sampling process at inference time using the learned Q-values: the generation temperature (beta parameter in gen_kwargs) controls how strongly Q-values influence token selection. Higher beta values produce more reward-optimized but less diverse outputs.
Key considerations:
- The beta parameter in gen_kwargs controls Q-value influence during generation
- Multiple beta values can be passed as a list for comparative evaluation
- The model can be saved in HuggingFace format with trainer.save_pretrained()