Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Volcengine Verl Multi Turn Rollout

From Leeroopedia


Knowledge Sources
Domains Agentic_AI, Reinforcement_Learning, Inference
Last Updated 2026-02-07 14:00 GMT

Overview

A state-machine-based rollout process where the model generates responses across multiple conversation turns, with tool execution between turns, producing complete trajectories for RL training.

Description

Multi-Turn Rollout extends standard single-turn rollout generation to support agentic workflows. Instead of generating a single response per prompt, the rollout engine manages a conversation loop:

  1. Model generates a response (may include tool calls)
  2. Tool calls are parsed and executed
  3. Tool results are appended to the conversation
  4. Model generates the next response
  5. Process repeats until termination (max turns or no more tool calls)

The implementation uses a state machine with states: PENDING → GENERATING → PROCESSING_TOOLS → ... → TERMINATED. Each trajectory records the full conversation history with a response mask that distinguishes model-generated tokens (mask=1, trainable) from tool-response tokens (mask=0, not trainable).

Usage

Use multi-turn rollout when training models for:

  • Tool-calling with external tools (calculators, code interpreters, search)
  • Multi-step reasoning that requires environmental feedback
  • Any agentic task where the model interacts with the world

Multi-turn rollout requires SGLang as the inference engine (actor_rollout_ref.rollout.name=sglang).

Theoretical Basis

Multi-turn rollout generates extended trajectories:

# Abstract multi-turn rollout state machine
state = PENDING
conversation = [system_msg, user_msg]
while state != TERMINATED:
    if state == PENDING:
        state = GENERATING
    elif state == GENERATING:
        response = model.generate(conversation)
        conversation.append(response)
        if has_tool_calls(response):
            state = PROCESSING_TOOLS
        else:
            state = TERMINATED
    elif state == PROCESSING_TOOLS:
        tool_results = execute_tools(response.tool_calls)
        conversation.append(tool_results)
        if turns < max_turns:
            state = GENERATING
        else:
            state = TERMINATED

The response mask ensures only model-generated tokens receive gradient updates:

  • Model tokens: mask = 1 (trainable)
  • Tool response tokens: mask = 0 (not trainable)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment