Principle:CarperAI Trlx Grounded Program Synthesis
| Knowledge Sources | |
|---|---|
| Domains | Program_Synthesis, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
Technique that trains language models to generate programs from input/output examples using reinforcement learning with execution-based feedback.
Description
Grounded program synthesis uses RL (specifically PPO) to train language models that can predict programs in a domain-specific language (DSL) given input/output example pairs. The approach is "grounded" because the reward signal comes from actually executing the predicted program on the input and comparing the output to the expected result. This provides a dense, verifiable reward signal compared to natural language tasks where reward is often based on human preference models.
Usage
Use this principle when training models to generate executable programs from specifications (I/O examples, natural language descriptions). Particularly effective when a deterministic interpreter exists for the target language, enabling automatic reward computation through execution.
Theoretical Basis
The core idea combines:
- Program Induction: Given input-output pairs , find a program such that for all .
- RL-based Search: Use PPO to optimize the language model policy to maximize the reward:
Pseudo-code Logic:
# Abstract algorithm (NOT real implementation)
for each (io_examples, ground_truth_program) in dataset:
predicted_program = model.generate(io_examples)
reward = 1.0 if execute(predicted_program) == expected_output else 0.0
update_policy(model, reward)