Implementation:Volcengine Verl Toolcall Shaping Reward
| Field | Value |
|---|---|
| Knowledge Sources | verl source code, GSM8K toolcall shaping example |
| Domains | Reward Shaping, Trajectory Scoring, Tool-Use Incentivization |
| Last Updated | 2026-02-07 |
Overview
Description
This module provides a trajectory-level reward shaping function designed to incentivize the model to use tool calls during GSM8K math problem solving. It combines three reward components:
- base_score -- The standard GSM8K correctness reward computed by
gsm8k_compute_score, which checks whether the model's answer matches the ground truth. - format_score -- A bonus awarded when the model's output follows the expected answer format (e.g., containing
"#### "). - shaping_reward -- An additional bonus awarded when the model includes the
trigger_substring(default:"<tool_call>") anywhere in its output, encouraging tool usage.
The final reward is: base_score + shaping_reward_bonus.
The compute_score wrapper function provides the standard interface expected by verl's reward configuration system, delegating to toolcall_shaping_reward with default parameters.
Usage
Reference this reward function in the trainer YAML config under custom_reward_function.path or reward_fn. The function signature matches verl's expected reward function interface: compute_score(data_source, solution_str, ground_truth, extra_info, **kwargs).
Code Reference
| Field | Value |
|---|---|
| Source Location | examples/sglang_multiturn/gsm8k_toolcall_shaping/gsm8k_toolcall_shaping.py, Lines 23-59
|
| Signatures | See below |
| Import | from examples.sglang_multiturn.gsm8k_toolcall_shaping.gsm8k_toolcall_shaping import compute_score
|
Primary function:
def toolcall_shaping_reward(
data_source: Optional[str],
solution_str: str,
ground_truth: str,
extra_info: Optional[dict[str, Any]] = None,
*,
method: str = "strict",
format_score: float = 0.1,
score: float = 1.0,
shaping_reward: float = 0.1,
trigger_substring: str = "<tool_call>",
**kwargs,
) -> float
Wrapper function (standard verl interface):
def compute_score(
data_source: Optional[str],
solution_str: str,
ground_truth: str,
extra_info: Optional[dict[str, Any]] = None,
**kwargs,
) -> float
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
data_source |
Optional[str] |
-- | Dataset identifier (e.g., "openai/gsm8k").
|
solution_str |
str |
-- | The model's full generated response text. |
ground_truth |
str |
-- | The expected correct answer extracted from the dataset. |
extra_info |
Optional[dict] |
None |
Additional metadata from the dataset row. |
method |
str |
"strict" |
Scoring method passed to gsm8k_compute_score. |
format_score |
float |
0.1 |
Bonus for correct output format. |
score |
float |
1.0 |
Score for correct answer. |
shaping_reward |
float |
0.1 |
Bonus for including the trigger substring. |
trigger_substring |
str |
"<tool_call>" |
Substring whose presence triggers the shaping bonus. |
Outputs
| Return | Type | Description |
|---|---|---|
| reward | float |
Combined reward: base GSM8K score + tool-call shaping bonus. |
Usage Examples
Direct invocation:
from examples.sglang_multiturn.gsm8k_toolcall_shaping.gsm8k_toolcall_shaping import (
compute_score,
toolcall_shaping_reward,
)
# Model used a tool call and got the right answer
solution_with_tool = (
"Let me calculate this.\n<tool_call>\n"
'{"name": "calculate", "arguments": {"expression": "15 * 4"}}\n'
"</tool_call>\nThe answer is #### 60"
)
reward = compute_score(
data_source="openai/gsm8k",
solution_str=solution_with_tool,
ground_truth="60",
)
# reward = 1.0 (correct) + 0.1 (format) + 0.1 (tool_call) = 1.2
print(f"Reward with tool call: {reward}")
# Model got the right answer without using a tool
solution_no_tool = "15 times 4 equals 60.\n#### 60"
reward = compute_score(
data_source="openai/gsm8k",
solution_str=solution_no_tool,
ground_truth="60",
)
# reward = 1.0 (correct) + 0.1 (format) + 0.0 (no tool_call) = 1.1
print(f"Reward without tool call: {reward}")
# Custom shaping reward value
reward = toolcall_shaping_reward(
data_source="openai/gsm8k",
solution_str=solution_with_tool,
ground_truth="60",
shaping_reward=0.5,
trigger_substring="<tool_call>",
)
print(f"Reward with higher shaping: {reward}")