Principle:CarperAI Trlx Grounded Program Synthesis

Knowledge Sources	Program Synthesis with LLMs
Domains	Program_Synthesis, NLP, Reinforcement_Learning
Last Updated	2026-02-07 16:00 GMT

Overview

Technique that trains language models to generate programs from input/output examples using reinforcement learning with execution-based feedback.

Description

Grounded program synthesis uses RL (specifically PPO) to train language models that can predict programs in a domain-specific language (DSL) given input/output example pairs. The approach is "grounded" because the reward signal comes from actually executing the predicted program on the input and comparing the output to the expected result. This provides a dense, verifiable reward signal compared to natural language tasks where reward is often based on human preference models.

Usage

Use this principle when training models to generate executable programs from specifications (I/O examples, natural language descriptions). Particularly effective when a deterministic interpreter exists for the target language, enabling automatic reward computation through execution.

Theoretical Basis

The core idea combines:

Program Induction: Given input-output pairs ${(x_{i}, y_{i})}_{i = 1}^{k}$ , find a program $p$ such that $p (x_{i}) = y_{i}$ for all $i$ .
RL-based Search: Use PPO to optimize the language model policy $π_{θ}$ to maximize the reward:

$R (p) = {\begin{cases} 1 & if \forall i : exec (p, x_{i}) = y_{i} \\ 0 & otherwise \end{cases}$

Pseudo-code Logic:

# Abstract algorithm (NOT real implementation)
for each (io_examples, ground_truth_program) in dataset:
    predicted_program = model.generate(io_examples)
    reward = 1.0 if execute(predicted_program) == expected_output else 0.0
    update_policy(model, reward)

Related Pages

Implementation:CarperAI_Trlx_DSL_Interpreter

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment