Principle:Vllm project Vllm Batch Text Generation
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, High-Performance Computing |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Batch text generation is the technique of processing multiple input prompts simultaneously through a language model to maximize hardware utilization and overall throughput.
Description
Language model inference on GPUs is often limited by memory bandwidth rather than compute capacity, especially for small batch sizes. Batching multiple prompts together allows the GPU to amortize the cost of loading model weights from memory across more tokens of useful work, dramatically improving throughput measured in tokens per second.
vLLM implements continuous batching (also called iteration-level scheduling), which is more efficient than static batching:
- Static batching: All sequences in a batch must complete before any new sequences can begin. Short sequences waste GPU cycles waiting for the longest sequence to finish.
- Continuous batching: The scheduler can insert new sequences into the batch as soon as any sequence finishes, keeping the GPU saturated at all times. This is managed automatically by the vLLM engine.
The key components of the batch generation pipeline are:
- Request queuing: All prompts are submitted to the engine's request queue.
- Scheduling: The scheduler selects which requests to run in each iteration based on available KV cache memory and priority.
- Prefill: For new requests, the model processes all input tokens in one forward pass, populating the KV cache.
- Decode: The model generates one token per iteration for each active sequence, appending to the KV cache.
- Output collection: As sequences reach their stop condition (max tokens, stop string, or EOS), they are removed from the batch and their outputs are returned.
Usage
Use batch generation whenever you have multiple prompts to process. Pass all prompts in a single list to LLM.generate() rather than calling it once per prompt. This allows vLLM's scheduler to maximize GPU utilization through continuous batching.
Theoretical Basis
Throughput model: For a transformer with L layers, hidden dimension d, and batch size B, the per-token compute cost scales as O(L * d^2) per token, while the memory bandwidth cost of loading weights is O(L * d^2) per batch (amortized across all B tokens). The arithmetic intensity (compute/memory ratio) therefore scales with B, explaining why larger batches yield better GPU utilization.
PagedAttention: vLLM's core memory management innovation. Instead of pre-allocating contiguous memory for each sequence's maximum possible length, PagedAttention manages KV cache in fixed-size blocks (pages), similar to virtual memory in operating systems. This eliminates internal memory fragmentation and allows:
- Dynamic memory allocation as sequences grow
- Memory sharing between sequences (e.g., for beam search or shared prefixes)
- Over-commitment of memory, enabling larger effective batch sizes
Continuous batching throughput: Given N prompts with varying output lengths, continuous batching achieves near-optimal throughput:
throughput approximately equals N * avg_output_length / total_wall_time
where total_wall_time approaches the theoretical minimum (limited by the longest single sequence) rather than the sum of all sequence times.