Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl Toolcall Shaping Reward

From Leeroopedia


Field Value
Knowledge Sources verl source code, GSM8K toolcall shaping example
Domains Reward Shaping, Trajectory Scoring, Tool-Use Incentivization
Last Updated 2026-02-07

Overview

Description

This module provides a trajectory-level reward shaping function designed to incentivize the model to use tool calls during GSM8K math problem solving. It combines three reward components:

  • base_score -- The standard GSM8K correctness reward computed by gsm8k_compute_score, which checks whether the model's answer matches the ground truth.
  • format_score -- A bonus awarded when the model's output follows the expected answer format (e.g., containing "#### ").
  • shaping_reward -- An additional bonus awarded when the model includes the trigger_substring (default: "<tool_call>") anywhere in its output, encouraging tool usage.

The final reward is: base_score + shaping_reward_bonus.

The compute_score wrapper function provides the standard interface expected by verl's reward configuration system, delegating to toolcall_shaping_reward with default parameters.

Usage

Reference this reward function in the trainer YAML config under custom_reward_function.path or reward_fn. The function signature matches verl's expected reward function interface: compute_score(data_source, solution_str, ground_truth, extra_info, **kwargs).

Code Reference

Field Value
Source Location examples/sglang_multiturn/gsm8k_toolcall_shaping/gsm8k_toolcall_shaping.py, Lines 23-59
Signatures See below
Import from examples.sglang_multiturn.gsm8k_toolcall_shaping.gsm8k_toolcall_shaping import compute_score

Primary function:

def toolcall_shaping_reward(
    data_source: Optional[str],
    solution_str: str,
    ground_truth: str,
    extra_info: Optional[dict[str, Any]] = None,
    *,
    method: str = "strict",
    format_score: float = 0.1,
    score: float = 1.0,
    shaping_reward: float = 0.1,
    trigger_substring: str = "<tool_call>",
    **kwargs,
) -> float

Wrapper function (standard verl interface):

def compute_score(
    data_source: Optional[str],
    solution_str: str,
    ground_truth: str,
    extra_info: Optional[dict[str, Any]] = None,
    **kwargs,
) -> float

I/O Contract

Inputs

Parameter Type Default Description
data_source Optional[str] -- Dataset identifier (e.g., "openai/gsm8k").
solution_str str -- The model's full generated response text.
ground_truth str -- The expected correct answer extracted from the dataset.
extra_info Optional[dict] None Additional metadata from the dataset row.
method str "strict" Scoring method passed to gsm8k_compute_score.
format_score float 0.1 Bonus for correct output format.
score float 1.0 Score for correct answer.
shaping_reward float 0.1 Bonus for including the trigger substring.
trigger_substring str "<tool_call>" Substring whose presence triggers the shaping bonus.

Outputs

Return Type Description
reward float Combined reward: base GSM8K score + tool-call shaping bonus.

Usage Examples

Direct invocation:

from examples.sglang_multiturn.gsm8k_toolcall_shaping.gsm8k_toolcall_shaping import (
    compute_score,
    toolcall_shaping_reward,
)

# Model used a tool call and got the right answer
solution_with_tool = (
    "Let me calculate this.\n<tool_call>\n"
    '{"name": "calculate", "arguments": {"expression": "15 * 4"}}\n'
    "</tool_call>\nThe answer is #### 60"
)
reward = compute_score(
    data_source="openai/gsm8k",
    solution_str=solution_with_tool,
    ground_truth="60",
)
# reward = 1.0 (correct) + 0.1 (format) + 0.1 (tool_call) = 1.2
print(f"Reward with tool call: {reward}")

# Model got the right answer without using a tool
solution_no_tool = "15 times 4 equals 60.\n#### 60"
reward = compute_score(
    data_source="openai/gsm8k",
    solution_str=solution_no_tool,
    ground_truth="60",
)
# reward = 1.0 (correct) + 0.1 (format) + 0.0 (no tool_call) = 1.1
print(f"Reward without tool call: {reward}")

# Custom shaping reward value
reward = toolcall_shaping_reward(
    data_source="openai/gsm8k",
    solution_str=solution_with_tool,
    ground_truth="60",
    shaping_reward=0.5,
    trigger_substring="<tool_call>",
)
print(f"Reward with higher shaping: {reward}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment