Implementation:Volcengine Verl ToolAgentLoop Run

Field	Value
Knowledge Sources	verl source code, agent loop module
Domains	Multi-Turn Rollout, Agent Loop, Tool Calling
Last Updated	2026-02-07

Overview

Description

ToolAgentLoop is the core agent loop class that orchestrates multi-turn rollout with tool calling. It inherits from AgentLoopBase and implements an explicit state machine that drives the conversation between the language model and external tools.

The state machine transitions are:

PENDING -- Prepare the initial prompt by applying the chat template with tool schemas, then transition to GENERATING.
GENERATING -- Send prompt tokens to the LLM server, collect generated tokens. Check termination conditions (response length, max turns). If tool calls are detected in the response, transition to PROCESSING_TOOLS. If an interaction is configured, transition to INTERACTING. Otherwise, transition to TERMINATED.
PROCESSING_TOOLS -- Execute each detected tool call (up to max_parallel_calls), collect tool responses, tokenize tool response messages with mask value 0 (non-trainable), and append to the prompt. Transition back to GENERATING.
INTERACTING -- Get user input from an interaction module, append as a user message with mask value 0, and transition back to GENERATING.
TERMINATED -- Finalize the output by splitting prompt and response IDs.

The response_mask is the critical output: model-generated tokens receive mask value 1 (included in policy gradient), while tool response tokens and interaction tokens receive mask value 0 (excluded from gradient computation). This selective masking is what enables credit assignment in multi-turn RL training.

Usage

ToolAgentLoop is instantiated once per rollout worker and its run method is called for each prompt in the batch. It is registered under the name "tool_agent" and can be selected via the trainer configuration.

Code Reference

Field	Value
Source Location	`verl/experimental/agent_loop/tool_agent_loop.py`, Lines 95-476
Class	`ToolAgentLoop(AgentLoopBase)`
Method Signature	`async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput`
Import	`from verl.experimental.agent_loop.tool_agent_loop import ToolAgentLoop`

Constructor:

ToolAgentLoop(
    trainer_config: DictConfigWrap,
    server_manager: AsyncLLMServerManager,
    tokenizer: AutoTokenizer,
    processor: AutoProcessor,
    **kwargs,
)

I/O Contract

Inputs

Parameter	Type	Description
`sampling_params`	`dict[str, Any]`	Sampling parameters passed to the LLM server (temperature, top_p, max_tokens, etc.).
`kwargs["raw_prompt"]`	`list[dict]`	The initial conversation messages in OpenAI chat format.
`kwargs["tools_kwargs"]`	`dict[str, Any]`	Optional per-tool keyword arguments for tool creation and execution.
`kwargs["extra_info"]`	`dict`	Extra information including `interaction_kwargs` for interaction-based rollouts.

Outputs

Field	Type	Description
`prompt_ids`	`list[int]`	Token IDs for the initial prompt (before any generation).
`response_ids`	`list[int]`	Token IDs for the full response, including both model-generated and tool-response tokens.
`response_mask`	`list[int]`	Binary mask: 1 for model-generated tokens, 0 for tool/interaction response tokens.
`response_logprobs`	`list[float] or None`	Log probabilities for each response token (0.0 for tool tokens).
`multi_modal_data`	`dict or None`	Images and videos accumulated during tool calls.
`num_turns`	`int`	Total number of conversation turns (user + assistant + 1).
`metrics`	`dict`	Performance metrics including generation and tool-call timings.
`extra_fields`	`dict`	Contains `turn_scores` and `tool_rewards` lists.

Usage Examples

Running a single multi-turn rollout:

from verl.experimental.agent_loop.tool_agent_loop import ToolAgentLoop

# Assume trainer_config, server_manager, tokenizer, processor are initialized
agent_loop = ToolAgentLoop(
    trainer_config=trainer_config,
    server_manager=server_manager,
    tokenizer=tokenizer,
    processor=processor,
)

# Run the agent loop for a single prompt
sampling_params = {"temperature": 0.7, "top_p": 0.9, "max_tokens": 1024}
output = await agent_loop.run(
    sampling_params=sampling_params,
    raw_prompt=[{"role": "user", "content": "What is 123 * 456?"}],
    tools_kwargs={},
)

# output.response_mask distinguishes model tokens (1) from tool responses (0)
print(f"Prompt length: {len(output.prompt_ids)}")
print(f"Response length: {len(output.response_ids)}")
print(f"Model tokens: {sum(output.response_mask)}")
print(f"Tool tokens: {len(output.response_mask) - sum(output.response_mask)}")
print(f"Total turns: {output.num_turns}")

State machine flow illustration:

# The internal state machine transitions:
# 1. PENDING -> GENERATING (apply chat template, prepare prompt_ids)
# 2. GENERATING -> PROCESSING_TOOLS (model output contains tool calls)
#    or GENERATING -> INTERACTING (interaction configured, no tool calls)
#    or GENERATING -> TERMINATED (no tool calls, no interaction)
# 3. PROCESSING_TOOLS -> GENERATING (tool responses appended with mask=0)
# 4. INTERACTING -> GENERATING (user response appended with mask=0)
#    or INTERACTING -> TERMINATED (interaction signals termination)
# 5. Loop continues until TERMINATED or max turns/length reached

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment