Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:CarperAI Trlx RL Data Structures

From Leeroopedia


Knowledge Sources
Domains Data_Structures, Reinforcement_Learning
Last Updated 2026-02-07 16:00 GMT

Overview

Design pattern for defining typed data containers that standardize the interface between RL data pipelines and trainers.

Description

RL training pipelines involve passing structured data (prompts, generated tokens, rewards, log-probabilities, values) between components: data pipelines produce prompt batches, models produce completions with logprobs, and trainers consume the combined rollout data. Typed dataclasses ensure consistent field names, tensor shapes, and types across these boundaries, preventing runtime errors from mismatched data structures. The single-element and batch-element distinction supports both per-sample processing and efficient batched operations.

Usage

Use this principle when designing the data flow between RL training components. Define dataclasses for each data interchange point (prompts, rollout elements, training batches) with explicit type annotations for tensor shapes.

Theoretical Basis

The pattern follows the Data Transfer Object design pattern:

  1. Single Element: Represents one data point (one prompt, one rollout).
  2. Batch Element: Represents a batch of data points with an additional batch dimension.
  3. Type Safety: TensorType annotations document expected shapes at the type level.
  4. Immutability: Dataclass fields are fixed, preventing accidental mutation.

Pseudo-code Logic:

# Abstract pattern (NOT real implementation)
@dataclass
class RLElement:
    tokens: Tensor["seq_len"]
    rewards: Tensor["seq_len"]

@dataclass
class RLBatch:
    tokens: Tensor["batch", "seq_len"]
    rewards: Tensor["batch", "seq_len"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment