Implementation:Alibaba ROLL VllmStrategy Generate
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Reinforcement_Learning |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete vLLM-based inference strategy for high-throughput LLM response generation provided by the Alibaba ROLL library.
Description
The VllmStrategy.generate method wraps vLLM's async inference engine to generate completions during RL rollouts. It handles input unpadding, sampling parameter configuration, LoRA adapter requests, beam search mode selection, and output padding/concatenation. The method operates asynchronously using Python's asyncio for non-blocking generation across large batches.
Usage
This strategy is used as the inference backend for actor_infer clusters in RLVR pipelines. It is selected by setting the inference strategy to "vllm" in the worker configuration.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/distributed/strategy/vllm_strategy.py
- Lines: L140-260
Signature
async def generate(
self,
batch: DataProto,
generation_config: Dict
) -> torch.Tensor:
"""
Generate continuations using vLLM inference engine.
Args:
batch: DataProto containing input_ids and attention_mask
generation_config: Dict with max_new_tokens, temperature, top_p,
num_return_sequences, num_beams, etc.
Returns:
torch.Tensor: Output token IDs (bs * num_return_sequences, total_length)
"""
Import
from roll.distributed.strategy.vllm_strategy import VllmStrategy
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| batch | DataProto | Yes | Contains input_ids and attention_mask tensors |
| generation_config | Dict | Yes | Generation parameters (max_new_tokens, temperature, top_p, num_return_sequences) |
Outputs
| Name | Type | Description |
|---|---|---|
| output_ids | torch.Tensor | Generated token IDs shape (bs * num_return_sequences, prompt_len + max_new_tokens) |
| DataProto | DataProto | Contains input_ids, attention_mask, response_mask, prompt_mask, old_log_probs |
Usage Examples
Standard Generation
from roll.distributed.scheduler.protocol import DataProto
# Prepare input batch
batch = DataProto.from_dict(tensors={
"input_ids": input_ids, # (batch_size, prompt_len)
"attention_mask": attention_mask # (batch_size, prompt_len)
})
# Configure generation
generation_config = {
"max_new_tokens": 512,
"temperature": 0.7,
"top_p": 0.9,
"num_return_sequences": 8, # 8 samples per prompt for GRPO
}
# Generate (called via cluster dispatch)
output_ids = await vllm_strategy.generate(batch, generation_config)
# output_ids shape: (batch_size * 8, prompt_len + 512)
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
This implementation uses the following heuristics: