Principle:Vllm project Vllm Output Processing
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Software Engineering |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Output processing is the extraction and interpretation of structured generation results from an LLM inference engine, including the generated text, token IDs, log probabilities, and finish metadata.
Description
After a language model generates a sequence of tokens in response to a prompt, the raw output must be organized into a structured representation that downstream application code can consume. Output processing encompasses:
- Detokenization: Converting the generated token ID sequence back into human-readable text. This involves handling special tokens, spacing conventions, and byte-level encoding.
- Finish reason classification: Identifying why generation stopped. Common reasons include reaching the maximum token limit ("length"), encountering a stop sequence or stop token ("stop"), or producing the end-of-sequence token.
- Log probability extraction: When requested, the engine returns the log probability of each generated token and optionally the top-k alternative tokens at each position. This is valuable for confidence estimation, calibration, and debugging.
- Multi-output aggregation: When the sampling configuration requests multiple completions per prompt (
n > 1), all completions are collected under the same request output and indexed for easy access. - Metadata association: Each output retains a reference to its originating request ID, the original prompt, prompt token IDs, and optional metrics (latency, cache statistics).
Usage
Process outputs after every call to LLM.generate() or LLM.chat(). The output structure is consistent regardless of how the prompt was prepared, making it the universal interface between the generation engine and application logic. Common access patterns include extracting the text of the first (or best) completion, checking finish reasons for quality control, and using log probabilities for scoring or ranking.
Theoretical Basis
The output structure mirrors the hierarchical nature of LLM generation:
Request level: One request corresponds to one input prompt. The engine assigns a unique request ID and preserves the association between input and output through the entire pipeline.
Completion level: Each request may produce n independent completions (controlled by the n parameter). Each completion is an independent sample from the model's output distribution for the same prompt.
Token level: Each completion consists of a sequence of token IDs. When log probabilities are requested, each token position carries:
- The log probability of the selected token:
log P(x_t | x_{<t}) - Optionally, the log probabilities of the top-k alternative tokens at that position
The cumulative log probability of a completion is the sum of individual token log probabilities:
cumulative_logprob = sum_{t=1}^{T} log P(x_t | x_{<t})
This is equivalent to the log of the joint probability of the entire sequence. It can be used for:
- Sequence ranking: Selecting the most likely completion among multiple samples
- Length normalization: Dividing by sequence length to avoid bias toward shorter sequences
- Perplexity computation:
perplexity = exp(-cumulative_logprob / T)
Finish reasons provide important signals for downstream logic:
- "stop": The model naturally concluded or hit a stop sequence, indicating a complete response
- "length": The model was truncated at max_tokens, indicating the response may be incomplete