Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Vllm project Vllm Offline Text Generation

From Leeroopedia


Knowledge Sources
Domains LLMs, Inference, Batch_Processing
Last Updated 2026-02-08 13:00 GMT

Overview

End-to-end process for running high-throughput offline text generation on Large Language Models using vLLM's batch inference engine.

Description

This workflow covers the standard procedure for generating text completions from one or more prompts using vLLM in offline (non-server) mode. It leverages PagedAttention for efficient KV cache management and continuous batching to maximize GPU utilization. The process covers model initialization, sampling parameter configuration, prompt preparation, batch generation, and output processing. This is the most fundamental vLLM use case and the foundation for all other workflows.

Usage

Execute this workflow when you have a set of text prompts and need to generate completions offline (not as a running server). Typical scenarios include batch processing of prompts, evaluation scripts, data generation pipelines, and prototyping. You need a HuggingFace model identifier or local model path and sufficient GPU memory to load the model.

Execution Steps

Step 1: Install vLLM

Install the vLLM package from PyPI or build from source. The installation includes all necessary dependencies for CUDA-based inference including PyTorch, Transformers, and custom CUDA kernels.

Key considerations:

  • Ensure CUDA toolkit compatibility with your GPU driver
  • For ROCm or CPU backends, install platform-specific variants
  • Building from source requires CMake and a C++ compiler

Step 2: Configure Sampling Parameters

Define the sampling strategy that controls how tokens are selected during generation. This includes temperature, top-p (nucleus sampling), top-k, repetition penalties, stop conditions, and maximum token limits.

Key considerations:

  • Temperature of 0.0 gives deterministic (greedy) decoding
  • top-p and top-k can be combined for finer control
  • Stop strings and stop token IDs terminate generation early
  • max_tokens limits the output length per request

Step 3: Initialize the LLM Engine

Create an LLM instance by specifying the model name or path. The engine loads the model weights, applies any quantization configuration, allocates KV cache memory using PagedAttention, and prepares CUDA graphs for optimized execution.

Key considerations:

  • Model name can be a HuggingFace ID or local path
  • gpu_memory_utilization controls KV cache allocation (default 0.9)
  • tensor_parallel_size enables multi-GPU inference
  • dtype can be set to float16, bfloat16, or auto
  • max_model_len limits the context window size

Step 4: Prepare Prompts

Format the input prompts as plain text strings or token ID lists. For chat-style models, apply the appropriate chat template to convert conversation messages into the expected prompt format.

Key considerations:

  • Use llm.chat() for conversation-style inputs with automatic template application
  • Use llm.generate() for raw text prompts
  • Prompts can be a single string or a list for batch processing
  • Token IDs can be passed directly via prompt_token_ids

Step 5: Run Batch Generation

Submit all prompts to the engine for batch generation. vLLM automatically handles continuous batching, scheduling, and KV cache management across all requests. The engine processes requests concurrently, maximizing throughput.

Key considerations:

  • Batch size is handled automatically by the engine scheduler
  • Prefix caching can be enabled for repeated prompt prefixes
  • Progress can be tracked with the use_tqdm parameter
  • The engine returns RequestOutput objects containing generated text and metadata

Step 6: Process Outputs

Extract generated text, token IDs, log probabilities, and finish reasons from the returned RequestOutput objects. Each output contains one or more CompletionOutput objects depending on the n parameter (number of completions per prompt).

Key considerations:

  • output.outputs[0].text contains the generated text
  • output.outputs[0].token_ids contains generated token IDs
  • finish_reason indicates whether generation stopped due to length, stop token, or stop string
  • Log probabilities are available when logprobs parameter is set

Execution Diagram

GitHub URL

Workflow Repository