Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:CarperAI Trlx Prompt Preparation

From Leeroopedia


Knowledge Sources
Domains Data_Pipeline, NLP, Tokenization
Last Updated 2026-02-07 16:00 GMT

Overview

A data pipeline principle for tokenizing and batching text prompts to serve as inputs for language model generation during RL training.

Description

In online RL training, the language model needs a stream of prompts to generate completions from. The prompt preparation pipeline converts raw text prompts into tokenized, padded batches suitable for model input. This involves truncation to a maximum prompt length (computed as seq_length - max_new_tokens), attention mask creation, and optional metadata passthrough for reward functions.

The pipeline must handle two input formats: simple string lists and dictionary lists with additional metadata (e.g., reference outputs for delta reward computation). It produces a DataLoader that yields batches of tokenized prompts during training.

Usage

Use prompt preparation when setting up any trlx training that requires generating text from prompts: PPO training with reward_fn, or evaluation prompt pipelines for ILQL/SFT. Prompt preparation is handled automatically by trlx.train() but can be customized by understanding the PromptPipeline interface.

Theoretical Basis

Prompt preparation transforms raw text into model-consumable format:

Pseudo-code:

# Abstract algorithm (not real implementation)
max_prompt_length = seq_length - max_new_tokens
tokenized = tokenizer(prompts, truncation=True, max_length=max_prompt_length)
batched = DataLoader(tokenized, batch_size=batch_size, collate_fn=pad_collate)

Key considerations:

  • Truncation: Prompts exceeding max length are truncated (right-side by default)
  • Padding: Shorter prompts are left-padded in batches for efficient generation
  • Metadata passthrough: Dict-format prompts carry extra keys to the reward function
  • Prompt budget: max_prompt_length = seq_length - max_new_tokens ensures room for generation

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment