Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct vLLM Async Generation

From Leeroopedia


Knowledge Sources
Domains Inference Optimization Reinforcement Learning
Last Updated 2026-02-07 00:00 GMT

Overview

vLLM asynchronous generation is the technique of using a high-throughput inference engine with PagedAttention and continuous batching to generate text completions asynchronously during reinforcement learning training.

Description

In GRPO training, each training step requires generating hundreds or thousands of completions from the current policy. Naive generation using the training model is extremely slow because:

  1. The training framework (DeepSpeed) is optimized for backward passes, not autoregressive decoding.
  2. Generation cannot be parallelized across the sequence dimension.
  3. KV-cache management in training frameworks is not optimized.

The solution is to decouple generation from training by running dedicated inference engines (vLLM) on separate GPUs. vLLM provides several key optimizations:

  • PagedAttention: Manages KV-cache as non-contiguous memory pages, similar to virtual memory in operating systems. This eliminates memory fragmentation and enables near-optimal GPU memory utilization.
  • Continuous batching: New requests can be added to a running batch without waiting for all current requests to complete, maximizing GPU utilization.
  • Tensor parallelism: For large models, vLLM can shard model weights across multiple GPUs within a single engine.
  • Asynchronous operation: The vLLM engine runs in its own event loop, processing generation requests concurrently with training on separate GPUs.

The key architectural insight is the actor-based separation: vLLM engines are Ray actors that receive prompts via a queue, generate completions, compute rewards, and return results to the data preparation pipeline. Meanwhile, the learner GPUs perform forward/backward passes on previously generated data.

Usage

vLLM async generation is used whenever the GRPO pipeline requires fast rollout generation. It is the standard generation backend for all GRPO training configurations. The number of engines, tensor parallel size, and GPU memory utilization are tuned based on available hardware.

Theoretical Basis

The throughput of the generation phase determines the overall training speed because training and generation operate in a producer-consumer pattern:

Generation throughput (tokens/sec) = (batch_size * avg_response_length) / generation_time

Required throughput = (num_prompts * num_samples * avg_length) / training_step_time

If generation_throughput < required_throughput:
    Training GPUs are idle, waiting for data
    => Add more vLLM engines or increase batch size

If generation_throughput > required_throughput:
    vLLM engines are idle, waiting for weight updates
    => Reduce engines or increase async_steps for overlap

PagedAttention maps each sequence's KV-cache to a set of physical blocks:

logical_block[seq_id, block_idx] -> physical_block[page_table[seq_id][block_idx]]

This indirection allows sequences of different lengths to share GPU memory without fragmentation, achieving near 100% memory utilization compared to the typical 60-70% with pre-allocated contiguous buffers.

Continuous batching enables the engine to process requests as they arrive rather than waiting for a batch to fill:

while requests in queue:
    batch = select_requests(available_memory, pending_requests)
    outputs = model.forward(batch)
    for request in outputs:
        if request.is_complete():
            yield request
            free(request.kv_cache)
            # Slot is immediately available for a new request

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment