Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Volcengine Verl Trajectory Reward Shaping

From Leeroopedia


Knowledge Sources
Domains Reinforcement_Learning, Reward_Engineering, Agentic_AI
Last Updated 2026-02-07 14:00 GMT

Overview

A reward engineering technique that provides intermediate reward bonuses or penalties for specific behaviors during multi-turn trajectories, supplementing the final outcome reward.

Description

Trajectory Reward Shaping augments the final outcome reward with step-level signals that guide the model toward desired behaviors during multi-turn interactions. In the context of tool-use training, reward shaping can:

  • Provide a bonus for correctly using tool-calling syntax (<tool_call>)
  • Penalize unproductive tool submissions (e.g., calling a tool without improving the answer)
  • Add format bonuses for following expected output structure

This is particularly important in multi-turn settings where the final reward alone may provide insufficient signal for learning to use tools effectively.

Usage

Use trajectory reward shaping when:

  • Training models to use tools in multi-turn settings
  • The final outcome reward alone is too sparse for effective learning
  • Specific intermediate behaviors should be encouraged or discouraged

Reward shaping is implemented as a custom reward function registered per dataset.

Theoretical Basis

Reward shaping adds potential-based intermediate rewards:

Rtotal=Routcome+t=1TRshaping(st,at)

In practice for tool-use training:

# Abstract trajectory reward shaping
def compute_shaped_reward(response, ground_truth):
    base_score = exact_match(extract_answer(response), ground_truth)
    # Bonus for using tool-call format
    format_bonus = 0.1 if "<tool_call>" in response else 0.0
    # Bonus for correct format even if wrong answer
    format_score = 0.1 if has_expected_format(response) else 0.0
    return base_score + format_bonus + format_score

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment