Principle:Volcengine Verl Multi Turn Rollout

Knowledge Sources	verl verl Multi-Turn Documentation
Domains	Agentic_AI, Reinforcement_Learning, Inference
Last Updated	2026-02-07 14:00 GMT

Overview

A state-machine-based rollout process where the model generates responses across multiple conversation turns, with tool execution between turns, producing complete trajectories for RL training.

Description

Multi-Turn Rollout extends standard single-turn rollout generation to support agentic workflows. Instead of generating a single response per prompt, the rollout engine manages a conversation loop:

Model generates a response (may include tool calls)
Tool calls are parsed and executed
Tool results are appended to the conversation
Model generates the next response
Process repeats until termination (max turns or no more tool calls)

The implementation uses a state machine with states: PENDING → GENERATING → PROCESSING_TOOLS → ... → TERMINATED. Each trajectory records the full conversation history with a response mask that distinguishes model-generated tokens (mask=1, trainable) from tool-response tokens (mask=0, not trainable).

Usage

Use multi-turn rollout when training models for:

Tool-calling with external tools (calculators, code interpreters, search)
Multi-step reasoning that requires environmental feedback
Any agentic task where the model interacts with the world

Multi-turn rollout requires SGLang as the inference engine (actor_rollout_ref.rollout.name=sglang).

Theoretical Basis

Multi-turn rollout generates extended trajectories:

# Abstract multi-turn rollout state machine
state = PENDING
conversation = [system_msg, user_msg]
while state != TERMINATED:
    if state == PENDING:
        state = GENERATING
    elif state == GENERATING:
        response = model.generate(conversation)
        conversation.append(response)
        if has_tool_calls(response):
            state = PROCESSING_TOOLS
        else:
            state = TERMINATED
    elif state == PROCESSING_TOOLS:
        tool_results = execute_tools(response.tool_calls)
        conversation.append(tool_results)
        if turns < max_turns:
            state = GENERATING
        else:
            state = TERMINATED

The response mask ensures only model-generated tokens receive gradient updates:

Model tokens: mask = 1 (trainable)
Tool response tokens: mask = 0 (not trainable)

Related Pages

Implemented By

Implementation:Volcengine_Verl_ToolAgentLoop_Run

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment