Principle:Allenai Open instruct vLLM Async Generation
| Knowledge Sources | |
|---|---|
| Domains | Inference Optimization Reinforcement Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
vLLM asynchronous generation is the technique of using a high-throughput inference engine with PagedAttention and continuous batching to generate text completions asynchronously during reinforcement learning training.
Description
In GRPO training, each training step requires generating hundreds or thousands of completions from the current policy. Naive generation using the training model is extremely slow because:
- The training framework (DeepSpeed) is optimized for backward passes, not autoregressive decoding.
- Generation cannot be parallelized across the sequence dimension.
- KV-cache management in training frameworks is not optimized.
The solution is to decouple generation from training by running dedicated inference engines (vLLM) on separate GPUs. vLLM provides several key optimizations:
- PagedAttention: Manages KV-cache as non-contiguous memory pages, similar to virtual memory in operating systems. This eliminates memory fragmentation and enables near-optimal GPU memory utilization.
- Continuous batching: New requests can be added to a running batch without waiting for all current requests to complete, maximizing GPU utilization.
- Tensor parallelism: For large models, vLLM can shard model weights across multiple GPUs within a single engine.
- Asynchronous operation: The vLLM engine runs in its own event loop, processing generation requests concurrently with training on separate GPUs.
The key architectural insight is the actor-based separation: vLLM engines are Ray actors that receive prompts via a queue, generate completions, compute rewards, and return results to the data preparation pipeline. Meanwhile, the learner GPUs perform forward/backward passes on previously generated data.
Usage
vLLM async generation is used whenever the GRPO pipeline requires fast rollout generation. It is the standard generation backend for all GRPO training configurations. The number of engines, tensor parallel size, and GPU memory utilization are tuned based on available hardware.
Theoretical Basis
The throughput of the generation phase determines the overall training speed because training and generation operate in a producer-consumer pattern:
Generation throughput (tokens/sec) = (batch_size * avg_response_length) / generation_time
Required throughput = (num_prompts * num_samples * avg_length) / training_step_time
If generation_throughput < required_throughput:
Training GPUs are idle, waiting for data
=> Add more vLLM engines or increase batch size
If generation_throughput > required_throughput:
vLLM engines are idle, waiting for weight updates
=> Reduce engines or increase async_steps for overlap
PagedAttention maps each sequence's KV-cache to a set of physical blocks:
logical_block[seq_id, block_idx] -> physical_block[page_table[seq_id][block_idx]]
This indirection allows sequences of different lengths to share GPU memory without fragmentation, achieving near 100% memory utilization compared to the typical 60-70% with pre-allocated contiguous buffers.
Continuous batching enables the engine to process requests as they arrive rather than waiting for a batch to fill:
while requests in queue:
batch = select_requests(available_memory, pending_requests)
outputs = model.forward(batch)
for request in outputs:
if request.is_complete():
yield request
free(request.kv_cache)
# Slot is immediately available for a new request