Principle:Volcengine Verl Trajectory Reward Shaping

Knowledge Sources	Reward Shaping in Reinforcement Learning verl
Domains	Reinforcement_Learning, Reward_Engineering, Agentic_AI
Last Updated	2026-02-07 14:00 GMT

Overview

A reward engineering technique that provides intermediate reward bonuses or penalties for specific behaviors during multi-turn trajectories, supplementing the final outcome reward.

Description

Trajectory Reward Shaping augments the final outcome reward with step-level signals that guide the model toward desired behaviors during multi-turn interactions. In the context of tool-use training, reward shaping can:

Provide a bonus for correctly using tool-calling syntax (<tool_call>)
Penalize unproductive tool submissions (e.g., calling a tool without improving the answer)
Add format bonuses for following expected output structure

This is particularly important in multi-turn settings where the final reward alone may provide insufficient signal for learning to use tools effectively.

Usage

Use trajectory reward shaping when:

Training models to use tools in multi-turn settings
The final outcome reward alone is too sparse for effective learning
Specific intermediate behaviors should be encouraged or discouraged

Reward shaping is implemented as a custom reward function registered per dataset.

Theoretical Basis

Reward shaping adds potential-based intermediate rewards:

$R_{t o t a l} = R_{o u t c o m e} + \sum_{t = 1}^{T} R_{s h a p i n g} (s_{t}, a_{t})$

In practice for tool-use training:

# Abstract trajectory reward shaping
def compute_shaped_reward(response, ground_truth):
    base_score = exact_match(extract_answer(response), ground_truth)
    # Bonus for using tool-call format
    format_bonus = 0.1 if "<tool_call>" in response else 0.0
    # Bonus for correct format even if wrong answer
    format_score = 0.1 if has_expected_format(response) else 0.0
    return base_score + format_bonus + format_score

Related Pages

Implemented By

Implementation:Volcengine_Verl_Toolcall_Shaping_Reward

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment