Implementation:Vllm project Vllm RequestOutput VLM Access
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Text Generation, Vision Language Models, Output Parsing |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for accessing generated text and metadata from VLM inference results through vLLM's RequestOutput and CompletionOutput dataclasses, provided by vLLM.
Description
After calling LLM.generate(), the results are returned as a list of RequestOutput objects, one per input prompt. Each RequestOutput contains:
request_id: A unique identifier for the request.prompt: The original prompt string (including vision token placeholders).prompt_token_ids: The tokenized prompt as a list of integer token IDs.outputs: A list ofCompletionOutputobjects (usually one for greedy/sampling, multiple for beam search).finished: Boolean indicating whether generation is complete.
Each CompletionOutput contains:
text: The generated text string (image descriptions, VQA answers, OCR results).token_ids: The generated token IDs as a sequence of integers.cumulative_logprob: The cumulative log probability of the generated sequence (useful for confidence estimation).logprobs: Per-token log probabilities (if requested via sampling params).finish_reason: Why generation stopped:"stop","length", orNone.stop_reason: The specific stop token or string that triggered stopping.
Usage
Use RequestOutput access when:
- Extracting the generated text from VLM inference for downstream processing.
- Analyzing generation confidence via log probabilities.
- Debugging VLM outputs by inspecting token IDs and finish reasons.
- Processing batch results where outputs must be matched to inputs.
Code Reference
Source Location
- Repository: vllm
- File:
vllm/outputs.py(lines 22-65 forCompletionOutput, lines 86-193 forRequestOutput)
Signature
@dataclass
class CompletionOutput:
index: int
text: str
token_ids: Sequence[int]
cumulative_logprob: float | None
logprobs: SampleLogprobs | None
finish_reason: str | None = None
stop_reason: int | str | None = None
lora_request: LoRARequest | None = None
class RequestOutput:
def __init__(
self,
request_id: str,
prompt: str | None,
prompt_token_ids: list[int] | None,
prompt_logprobs: PromptLogprobs | None,
outputs: list[CompletionOutput],
finished: bool,
metrics: RequestStateStats | None = None,
lora_request: LoRARequest | None = None,
encoder_prompt: str | None = None,
encoder_prompt_token_ids: list[int] | None = None,
num_cached_tokens: int | None = None,
*,
multi_modal_placeholders: MultiModalPlaceholderDict | None = None,
kv_transfer_params: dict[str, Any] | None = None,
) -> None: ...
Import
from vllm.outputs import RequestOutput, CompletionOutput
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| outputs | list[RequestOutput] |
Yes | Return value from LLM.generate()
|
Outputs
| Name | Type | Description |
|---|---|---|
| text | str |
Generated text (image description, VQA answer, OCR result, etc.) |
| token_ids | Sequence[int] |
Token IDs of the generated output |
| cumulative_logprob | None | Cumulative log probability of the entire generated sequence |
| finish_reason | None | Reason generation stopped: "stop", "length", or None
|
| stop_reason | str | None | Specific stop token/string that triggered termination |
| prompt | None | Original prompt string for correlation with input |
| request_id | str |
Unique request identifier for batch tracking |
Usage Examples
Basic Text Extraction from VLM Output
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
max_model_len=4096,
limit_mm_per_prompt={"image": 1},
)
image = ImageAsset("cherry_blossom").pil_image
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
outputs = llm.generate(
{"prompt": prompt, "multi_modal_data": {"image": image}},
sampling_params=SamplingParams(temperature=0, max_tokens=128),
)
# Extract the generated text
generated_text = outputs[0].outputs[0].text
print(generated_text)
# Example: "The image shows cherry blossom trees in full bloom..."
Processing Batch Results
# After batch generation
outputs = llm.generate(batch_prompts, sampling_params=sampling_params)
for i, output in enumerate(outputs):
generated_text = output.outputs[0].text
finish_reason = output.outputs[0].finish_reason
print(f"Request {i}: {generated_text}")
print(f" Finish reason: {finish_reason}")
print(f" Tokens generated: {len(output.outputs[0].token_ids)}")
Checking Finish Reason and Confidence
output = outputs[0]
completion = output.outputs[0]
# Check if generation completed naturally
if completion.finish_reason == "stop":
print("Generation completed naturally (hit stop token)")
elif completion.finish_reason == "length":
print("Generation truncated (hit max_tokens limit)")
# Check confidence via cumulative log probability
if completion.cumulative_logprob is not None:
print(f"Cumulative log probability: {completion.cumulative_logprob}")
Full Pipeline: Image Captioning with Output Processing
from vllm import LLM, SamplingParams
from PIL import Image
# Setup
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
max_model_len=4096,
limit_mm_per_prompt={"image": 1},
)
# Load image
image = Image.open("/path/to/photo.jpg").convert("RGB")
# Generate
outputs = llm.generate(
{
"prompt": "USER: <image>\nDescribe this image.\nASSISTANT:",
"multi_modal_data": {"image": image},
},
sampling_params=SamplingParams(temperature=0, max_tokens=256),
)
# Process output
result = outputs[0]
caption = result.outputs[0].text.strip()
num_tokens = len(result.outputs[0].token_ids)
was_truncated = result.outputs[0].finish_reason == "length"
print(f"Caption: {caption}")
print(f"Generated {num_tokens} tokens")
if was_truncated:
print("Warning: output was truncated, consider increasing max_tokens")
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment