Implementation:Vllm project Vllm RequestOutput VLM Access

Knowledge Sources	vLLM vLLM Docs
Domains	Text Generation, Vision Language Models, Output Parsing
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for accessing generated text and metadata from VLM inference results through vLLM's RequestOutput and CompletionOutput dataclasses, provided by vLLM.

Description

After calling LLM.generate(), the results are returned as a list of RequestOutput objects, one per input prompt. Each RequestOutput contains:

request_id: A unique identifier for the request.
prompt: The original prompt string (including vision token placeholders).
prompt_token_ids: The tokenized prompt as a list of integer token IDs.
outputs: A list of CompletionOutput objects (usually one for greedy/sampling, multiple for beam search).
finished: Boolean indicating whether generation is complete.

Each CompletionOutput contains:

text: The generated text string (image descriptions, VQA answers, OCR results).
token_ids: The generated token IDs as a sequence of integers.
cumulative_logprob: The cumulative log probability of the generated sequence (useful for confidence estimation).
logprobs: Per-token log probabilities (if requested via sampling params).
finish_reason: Why generation stopped: "stop", "length", or None.
stop_reason: The specific stop token or string that triggered stopping.

Usage

Use RequestOutput access when:

Extracting the generated text from VLM inference for downstream processing.
Analyzing generation confidence via log probabilities.
Debugging VLM outputs by inspecting token IDs and finish reasons.
Processing batch results where outputs must be matched to inputs.

Code Reference

Source Location

Repository: vllm
File: vllm/outputs.py (lines 22-65 for CompletionOutput, lines 86-193 for RequestOutput)

Signature

@dataclass
class CompletionOutput:
    index: int
    text: str
    token_ids: Sequence[int]
    cumulative_logprob: float | None
    logprobs: SampleLogprobs | None
    finish_reason: str | None = None
    stop_reason: int | str | None = None
    lora_request: LoRARequest | None = None

class RequestOutput:
    def __init__(
        self,
        request_id: str,
        prompt: str | None,
        prompt_token_ids: list[int] | None,
        prompt_logprobs: PromptLogprobs | None,
        outputs: list[CompletionOutput],
        finished: bool,
        metrics: RequestStateStats | None = None,
        lora_request: LoRARequest | None = None,
        encoder_prompt: str | None = None,
        encoder_prompt_token_ids: list[int] | None = None,
        num_cached_tokens: int | None = None,
        *,
        multi_modal_placeholders: MultiModalPlaceholderDict | None = None,
        kv_transfer_params: dict[str, Any] | None = None,
    ) -> None: ...

Import

from vllm.outputs import RequestOutput, CompletionOutput

I/O Contract

Inputs

Name	Type	Required	Description
outputs	`list[RequestOutput]`	Yes	Return value from `LLM.generate()`

Outputs

Name	Type	Description
text	`str`	Generated text (image description, VQA answer, OCR result, etc.)
token_ids	`Sequence[int]`	Token IDs of the generated output
cumulative_logprob	None	Cumulative log probability of the entire generated sequence
finish_reason	None	Reason generation stopped: `"stop"`, `"length"`, or `None`
stop_reason	str \| None	Specific stop token/string that triggered termination
prompt	None	Original prompt string for correlation with input
request_id	`str`	Unique request identifier for batch tracking

Usage Examples

Basic Text Extraction from VLM Output

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    max_model_len=4096,
    limit_mm_per_prompt={"image": 1},
)

image = ImageAsset("cherry_blossom").pil_image
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

outputs = llm.generate(
    {"prompt": prompt, "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0, max_tokens=128),
)

# Extract the generated text
generated_text = outputs[0].outputs[0].text
print(generated_text)
# Example: "The image shows cherry blossom trees in full bloom..."

Processing Batch Results

# After batch generation
outputs = llm.generate(batch_prompts, sampling_params=sampling_params)

for i, output in enumerate(outputs):
    generated_text = output.outputs[0].text
    finish_reason = output.outputs[0].finish_reason
    print(f"Request {i}: {generated_text}")
    print(f"  Finish reason: {finish_reason}")
    print(f"  Tokens generated: {len(output.outputs[0].token_ids)}")

Checking Finish Reason and Confidence

output = outputs[0]
completion = output.outputs[0]

# Check if generation completed naturally
if completion.finish_reason == "stop":
    print("Generation completed naturally (hit stop token)")
elif completion.finish_reason == "length":
    print("Generation truncated (hit max_tokens limit)")

# Check confidence via cumulative log probability
if completion.cumulative_logprob is not None:
    print(f"Cumulative log probability: {completion.cumulative_logprob}")

Full Pipeline: Image Captioning with Output Processing

from vllm import LLM, SamplingParams
from PIL import Image

# Setup
llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    max_model_len=4096,
    limit_mm_per_prompt={"image": 1},
)

# Load image
image = Image.open("/path/to/photo.jpg").convert("RGB")

# Generate
outputs = llm.generate(
    {
        "prompt": "USER: <image>\nDescribe this image.\nASSISTANT:",
        "multi_modal_data": {"image": image},
    },
    sampling_params=SamplingParams(temperature=0, max_tokens=256),
)

# Process output
result = outputs[0]
caption = result.outputs[0].text.strip()
num_tokens = len(result.outputs[0].token_ids)
was_truncated = result.outputs[0].finish_reason == "length"

print(f"Caption: {caption}")
print(f"Generated {num_tokens} tokens")
if was_truncated:
    print("Warning: output was truncated, consider increasing max_tokens")

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Multimodal_Output_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment