Principle:Volcengine Verl Multi Turn Rollout
| Knowledge Sources | |
|---|---|
| Domains | Agentic_AI, Reinforcement_Learning, Inference |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
A state-machine-based rollout process where the model generates responses across multiple conversation turns, with tool execution between turns, producing complete trajectories for RL training.
Description
Multi-Turn Rollout extends standard single-turn rollout generation to support agentic workflows. Instead of generating a single response per prompt, the rollout engine manages a conversation loop:
- Model generates a response (may include tool calls)
- Tool calls are parsed and executed
- Tool results are appended to the conversation
- Model generates the next response
- Process repeats until termination (max turns or no more tool calls)
The implementation uses a state machine with states: PENDING → GENERATING → PROCESSING_TOOLS → ... → TERMINATED. Each trajectory records the full conversation history with a response mask that distinguishes model-generated tokens (mask=1, trainable) from tool-response tokens (mask=0, not trainable).
Usage
Use multi-turn rollout when training models for:
- Tool-calling with external tools (calculators, code interpreters, search)
- Multi-step reasoning that requires environmental feedback
- Any agentic task where the model interacts with the world
Multi-turn rollout requires SGLang as the inference engine (actor_rollout_ref.rollout.name=sglang).
Theoretical Basis
Multi-turn rollout generates extended trajectories:
# Abstract multi-turn rollout state machine
state = PENDING
conversation = [system_msg, user_msg]
while state != TERMINATED:
if state == PENDING:
state = GENERATING
elif state == GENERATING:
response = model.generate(conversation)
conversation.append(response)
if has_tool_calls(response):
state = PROCESSING_TOOLS
else:
state = TERMINATED
elif state == PROCESSING_TOOLS:
tool_results = execute_tools(response.tool_calls)
conversation.append(tool_results)
if turns < max_turns:
state = GENERATING
else:
state = TERMINATED
The response mask ensures only model-generated tokens receive gradient updates:
- Model tokens: mask = 1 (trainable)
- Tool response tokens: mask = 0 (not trainable)