Workflow:Vllm project Vllm Offline Text Generation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference, Batch_Processing |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
End-to-end process for running high-throughput offline text generation on Large Language Models using vLLM's batch inference engine.
Description
This workflow covers the standard procedure for generating text completions from one or more prompts using vLLM in offline (non-server) mode. It leverages PagedAttention for efficient KV cache management and continuous batching to maximize GPU utilization. The process covers model initialization, sampling parameter configuration, prompt preparation, batch generation, and output processing. This is the most fundamental vLLM use case and the foundation for all other workflows.
Usage
Execute this workflow when you have a set of text prompts and need to generate completions offline (not as a running server). Typical scenarios include batch processing of prompts, evaluation scripts, data generation pipelines, and prototyping. You need a HuggingFace model identifier or local model path and sufficient GPU memory to load the model.
Execution Steps
Step 1: Install vLLM
Install the vLLM package from PyPI or build from source. The installation includes all necessary dependencies for CUDA-based inference including PyTorch, Transformers, and custom CUDA kernels.
Key considerations:
- Ensure CUDA toolkit compatibility with your GPU driver
- For ROCm or CPU backends, install platform-specific variants
- Building from source requires CMake and a C++ compiler
Step 2: Configure Sampling Parameters
Define the sampling strategy that controls how tokens are selected during generation. This includes temperature, top-p (nucleus sampling), top-k, repetition penalties, stop conditions, and maximum token limits.
Key considerations:
- Temperature of 0.0 gives deterministic (greedy) decoding
- top-p and top-k can be combined for finer control
- Stop strings and stop token IDs terminate generation early
- max_tokens limits the output length per request
Step 3: Initialize the LLM Engine
Create an LLM instance by specifying the model name or path. The engine loads the model weights, applies any quantization configuration, allocates KV cache memory using PagedAttention, and prepares CUDA graphs for optimized execution.
Key considerations:
- Model name can be a HuggingFace ID or local path
- gpu_memory_utilization controls KV cache allocation (default 0.9)
- tensor_parallel_size enables multi-GPU inference
- dtype can be set to float16, bfloat16, or auto
- max_model_len limits the context window size
Step 4: Prepare Prompts
Format the input prompts as plain text strings or token ID lists. For chat-style models, apply the appropriate chat template to convert conversation messages into the expected prompt format.
Key considerations:
- Use llm.chat() for conversation-style inputs with automatic template application
- Use llm.generate() for raw text prompts
- Prompts can be a single string or a list for batch processing
- Token IDs can be passed directly via prompt_token_ids
Step 5: Run Batch Generation
Submit all prompts to the engine for batch generation. vLLM automatically handles continuous batching, scheduling, and KV cache management across all requests. The engine processes requests concurrently, maximizing throughput.
Key considerations:
- Batch size is handled automatically by the engine scheduler
- Prefix caching can be enabled for repeated prompt prefixes
- Progress can be tracked with the use_tqdm parameter
- The engine returns RequestOutput objects containing generated text and metadata
Step 6: Process Outputs
Extract generated text, token IDs, log probabilities, and finish reasons from the returned RequestOutput objects. Each output contains one or more CompletionOutput objects depending on the n parameter (number of completions per prompt).
Key considerations:
- output.outputs[0].text contains the generated text
- output.outputs[0].token_ids contains generated token IDs
- finish_reason indicates whether generation stopped due to length, stop token, or stop string
- Log probabilities are available when logprobs parameter is set