Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL VllmStrategy Generate

From Leeroopedia


Knowledge Sources
Domains LLM_Inference, Reinforcement_Learning
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete vLLM-based inference strategy for high-throughput LLM response generation provided by the Alibaba ROLL library.

Description

The VllmStrategy.generate method wraps vLLM's async inference engine to generate completions during RL rollouts. It handles input unpadding, sampling parameter configuration, LoRA adapter requests, beam search mode selection, and output padding/concatenation. The method operates asynchronously using Python's asyncio for non-blocking generation across large batches.

Usage

This strategy is used as the inference backend for actor_infer clusters in RLVR pipelines. It is selected by setting the inference strategy to "vllm" in the worker configuration.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/distributed/strategy/vllm_strategy.py
  • Lines: L140-260

Signature

async def generate(
    self,
    batch: DataProto,
    generation_config: Dict
) -> torch.Tensor:
    """
    Generate continuations using vLLM inference engine.

    Args:
        batch: DataProto containing input_ids and attention_mask
        generation_config: Dict with max_new_tokens, temperature, top_p,
                          num_return_sequences, num_beams, etc.

    Returns:
        torch.Tensor: Output token IDs (bs * num_return_sequences, total_length)
    """

Import

from roll.distributed.strategy.vllm_strategy import VllmStrategy

I/O Contract

Inputs

Name Type Required Description
batch DataProto Yes Contains input_ids and attention_mask tensors
generation_config Dict Yes Generation parameters (max_new_tokens, temperature, top_p, num_return_sequences)

Outputs

Name Type Description
output_ids torch.Tensor Generated token IDs shape (bs * num_return_sequences, prompt_len + max_new_tokens)
DataProto DataProto Contains input_ids, attention_mask, response_mask, prompt_mask, old_log_probs

Usage Examples

Standard Generation

from roll.distributed.scheduler.protocol import DataProto

# Prepare input batch
batch = DataProto.from_dict(tensors={
    "input_ids": input_ids,        # (batch_size, prompt_len)
    "attention_mask": attention_mask # (batch_size, prompt_len)
})

# Configure generation
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.9,
    "num_return_sequences": 8,  # 8 samples per prompt for GRPO
}

# Generate (called via cluster dispatch)
output_ids = await vllm_strategy.generate(batch, generation_config)
# output_ids shape: (batch_size * 8, prompt_len + 512)

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

This implementation uses the following heuristics:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment