Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Volcengine Verl ToolAgentLoop Run

From Leeroopedia


Field Value
Knowledge Sources verl source code, agent loop module
Domains Multi-Turn Rollout, Agent Loop, Tool Calling
Last Updated 2026-02-07

Overview

Description

ToolAgentLoop is the core agent loop class that orchestrates multi-turn rollout with tool calling. It inherits from AgentLoopBase and implements an explicit state machine that drives the conversation between the language model and external tools.

The state machine transitions are:

  • PENDING -- Prepare the initial prompt by applying the chat template with tool schemas, then transition to GENERATING.
  • GENERATING -- Send prompt tokens to the LLM server, collect generated tokens. Check termination conditions (response length, max turns). If tool calls are detected in the response, transition to PROCESSING_TOOLS. If an interaction is configured, transition to INTERACTING. Otherwise, transition to TERMINATED.
  • PROCESSING_TOOLS -- Execute each detected tool call (up to max_parallel_calls), collect tool responses, tokenize tool response messages with mask value 0 (non-trainable), and append to the prompt. Transition back to GENERATING.
  • INTERACTING -- Get user input from an interaction module, append as a user message with mask value 0, and transition back to GENERATING.
  • TERMINATED -- Finalize the output by splitting prompt and response IDs.

The response_mask is the critical output: model-generated tokens receive mask value 1 (included in policy gradient), while tool response tokens and interaction tokens receive mask value 0 (excluded from gradient computation). This selective masking is what enables credit assignment in multi-turn RL training.

Usage

ToolAgentLoop is instantiated once per rollout worker and its run method is called for each prompt in the batch. It is registered under the name "tool_agent" and can be selected via the trainer configuration.

Code Reference

Field Value
Source Location verl/experimental/agent_loop/tool_agent_loop.py, Lines 95-476
Class ToolAgentLoop(AgentLoopBase)
Method Signature async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput
Import from verl.experimental.agent_loop.tool_agent_loop import ToolAgentLoop

Constructor:

ToolAgentLoop(
    trainer_config: DictConfigWrap,
    server_manager: AsyncLLMServerManager,
    tokenizer: AutoTokenizer,
    processor: AutoProcessor,
    **kwargs,
)

I/O Contract

Inputs

Parameter Type Description
sampling_params dict[str, Any] Sampling parameters passed to the LLM server (temperature, top_p, max_tokens, etc.).
kwargs["raw_prompt"] list[dict] The initial conversation messages in OpenAI chat format.
kwargs["tools_kwargs"] dict[str, Any] Optional per-tool keyword arguments for tool creation and execution.
kwargs["extra_info"] dict Extra information including interaction_kwargs for interaction-based rollouts.

Outputs

Field Type Description
prompt_ids list[int] Token IDs for the initial prompt (before any generation).
response_ids list[int] Token IDs for the full response, including both model-generated and tool-response tokens.
response_mask list[int] Binary mask: 1 for model-generated tokens, 0 for tool/interaction response tokens.
response_logprobs list[float] or None Log probabilities for each response token (0.0 for tool tokens).
multi_modal_data dict or None Images and videos accumulated during tool calls.
num_turns int Total number of conversation turns (user + assistant + 1).
metrics dict Performance metrics including generation and tool-call timings.
extra_fields dict Contains turn_scores and tool_rewards lists.

Usage Examples

Running a single multi-turn rollout:

from verl.experimental.agent_loop.tool_agent_loop import ToolAgentLoop

# Assume trainer_config, server_manager, tokenizer, processor are initialized
agent_loop = ToolAgentLoop(
    trainer_config=trainer_config,
    server_manager=server_manager,
    tokenizer=tokenizer,
    processor=processor,
)

# Run the agent loop for a single prompt
sampling_params = {"temperature": 0.7, "top_p": 0.9, "max_tokens": 1024}
output = await agent_loop.run(
    sampling_params=sampling_params,
    raw_prompt=[{"role": "user", "content": "What is 123 * 456?"}],
    tools_kwargs={},
)

# output.response_mask distinguishes model tokens (1) from tool responses (0)
print(f"Prompt length: {len(output.prompt_ids)}")
print(f"Response length: {len(output.response_ids)}")
print(f"Model tokens: {sum(output.response_mask)}")
print(f"Tool tokens: {len(output.response_mask) - sum(output.response_mask)}")
print(f"Total turns: {output.num_turns}")

State machine flow illustration:

# The internal state machine transitions:
# 1. PENDING -> GENERATING (apply chat template, prepare prompt_ids)
# 2. GENERATING -> PROCESSING_TOOLS (model output contains tool calls)
#    or GENERATING -> INTERACTING (interaction configured, no tool calls)
#    or GENERATING -> TERMINATED (no tool calls, no interaction)
# 3. PROCESSING_TOOLS -> GENERATING (tool responses appended with mask=0)
# 4. INTERACTING -> GENERATING (user response appended with mask=0)
#    or INTERACTING -> TERMINATED (interaction signals termination)
# 5. Loop continues until TERMINATED or max turns/length reached

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment