Implementation:Volcengine Verl Toolcall Shaping Reward

Field	Value
Knowledge Sources	verl source code, GSM8K toolcall shaping example
Domains	Reward Shaping, Trajectory Scoring, Tool-Use Incentivization
Last Updated	2026-02-07

Overview

Description

This module provides a trajectory-level reward shaping function designed to incentivize the model to use tool calls during GSM8K math problem solving. It combines three reward components:

base_score -- The standard GSM8K correctness reward computed by gsm8k_compute_score, which checks whether the model's answer matches the ground truth.
format_score -- A bonus awarded when the model's output follows the expected answer format (e.g., containing "#### ").
shaping_reward -- An additional bonus awarded when the model includes the trigger_substring (default: "<tool_call>") anywhere in its output, encouraging tool usage.

The final reward is: base_score + shaping_reward_bonus.

The compute_score wrapper function provides the standard interface expected by verl's reward configuration system, delegating to toolcall_shaping_reward with default parameters.

Usage

Reference this reward function in the trainer YAML config under custom_reward_function.path or reward_fn. The function signature matches verl's expected reward function interface: compute_score(data_source, solution_str, ground_truth, extra_info, **kwargs).

Code Reference

Field	Value
Source Location	`examples/sglang_multiturn/gsm8k_toolcall_shaping/gsm8k_toolcall_shaping.py`, Lines 23-59
Signatures	See below
Import	`from examples.sglang_multiturn.gsm8k_toolcall_shaping.gsm8k_toolcall_shaping import compute_score`

Primary function:

def toolcall_shaping_reward(
    data_source: Optional[str],
    solution_str: str,
    ground_truth: str,
    extra_info: Optional[dict[str, Any]] = None,
    *,
    method: str = "strict",
    format_score: float = 0.1,
    score: float = 1.0,
    shaping_reward: float = 0.1,
    trigger_substring: str = "<tool_call>",
    **kwargs,
) -> float

Wrapper function (standard verl interface):

def compute_score(
    data_source: Optional[str],
    solution_str: str,
    ground_truth: str,
    extra_info: Optional[dict[str, Any]] = None,
    **kwargs,
) -> float

I/O Contract

Inputs

Parameter	Type	Default	Description
`data_source`	`Optional[str]`	--	Dataset identifier (e.g., `"openai/gsm8k"`).
`solution_str`	`str`	--	The model's full generated response text.
`ground_truth`	`str`	--	The expected correct answer extracted from the dataset.
`extra_info`	`Optional[dict]`	`None`	Additional metadata from the dataset row.
`method`	`str`	`"strict"`	Scoring method passed to gsm8k_compute_score.
`format_score`	`float`	`0.1`	Bonus for correct output format.
`score`	`float`	`1.0`	Score for correct answer.
`shaping_reward`	`float`	`0.1`	Bonus for including the trigger substring.
`trigger_substring`	`str`	`"<tool_call>"`	Substring whose presence triggers the shaping bonus.

Outputs

Return	Type	Description
reward	`float`	Combined reward: base GSM8K score + tool-call shaping bonus.

Usage Examples

Direct invocation:

from examples.sglang_multiturn.gsm8k_toolcall_shaping.gsm8k_toolcall_shaping import (
    compute_score,
    toolcall_shaping_reward,
)

# Model used a tool call and got the right answer
solution_with_tool = (
    "Let me calculate this.\n<tool_call>\n"
    '{"name": "calculate", "arguments": {"expression": "15 * 4"}}\n'
    "</tool_call>\nThe answer is #### 60"
)
reward = compute_score(
    data_source="openai/gsm8k",
    solution_str=solution_with_tool,
    ground_truth="60",
)
# reward = 1.0 (correct) + 0.1 (format) + 0.1 (tool_call) = 1.2
print(f"Reward with tool call: {reward}")

# Model got the right answer without using a tool
solution_no_tool = "15 times 4 equals 60.\n#### 60"
reward = compute_score(
    data_source="openai/gsm8k",
    solution_str=solution_no_tool,
    ground_truth="60",
)
# reward = 1.0 (correct) + 0.1 (format) + 0.0 (no tool_call) = 1.1
print(f"Reward without tool call: {reward}")

# Custom shaping reward value
reward = toolcall_shaping_reward(
    data_source="openai/gsm8k",
    solution_str=solution_with_tool,
    ground_truth="60",
    shaping_reward=0.5,
    trigger_substring="<tool_call>",
)
print(f"Reward with higher shaping: {reward}")

Related Pages

Principle:Volcengine_Verl_Trajectory_Reward_Shaping

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment