Principle:Volcengine Verl Trajectory Reward Shaping
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Reward_Engineering, Agentic_AI |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A reward engineering technique that provides intermediate reward bonuses or penalties for specific behaviors during multi-turn trajectories, supplementing the final outcome reward.
Description
Trajectory Reward Shaping augments the final outcome reward with step-level signals that guide the model toward desired behaviors during multi-turn interactions. In the context of tool-use training, reward shaping can:
- Provide a bonus for correctly using tool-calling syntax (
<tool_call>) - Penalize unproductive tool submissions (e.g., calling a tool without improving the answer)
- Add format bonuses for following expected output structure
This is particularly important in multi-turn settings where the final reward alone may provide insufficient signal for learning to use tools effectively.
Usage
Use trajectory reward shaping when:
- Training models to use tools in multi-turn settings
- The final outcome reward alone is too sparse for effective learning
- Specific intermediate behaviors should be encouraged or discouraged
Reward shaping is implemented as a custom reward function registered per dataset.
Theoretical Basis
Reward shaping adds potential-based intermediate rewards:
In practice for tool-use training:
# Abstract trajectory reward shaping
def compute_shaped_reward(response, ground_truth):
base_score = exact_match(extract_answer(response), ground_truth)
# Bonus for using tool-call format
format_bonus = 0.1 if "<tool_call>" in response else 0.0
# Bonus for correct format even if wrong answer
format_score = 0.1 if has_expected_format(response) else 0.0
return base_score + format_bonus + format_score