Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm RequestOutput VLM Access

From Leeroopedia


Knowledge Sources
Domains Text Generation, Vision Language Models, Output Parsing
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for accessing generated text and metadata from VLM inference results through vLLM's RequestOutput and CompletionOutput dataclasses, provided by vLLM.

Description

After calling LLM.generate(), the results are returned as a list of RequestOutput objects, one per input prompt. Each RequestOutput contains:

  • request_id: A unique identifier for the request.
  • prompt: The original prompt string (including vision token placeholders).
  • prompt_token_ids: The tokenized prompt as a list of integer token IDs.
  • outputs: A list of CompletionOutput objects (usually one for greedy/sampling, multiple for beam search).
  • finished: Boolean indicating whether generation is complete.

Each CompletionOutput contains:

  • text: The generated text string (image descriptions, VQA answers, OCR results).
  • token_ids: The generated token IDs as a sequence of integers.
  • cumulative_logprob: The cumulative log probability of the generated sequence (useful for confidence estimation).
  • logprobs: Per-token log probabilities (if requested via sampling params).
  • finish_reason: Why generation stopped: "stop", "length", or None.
  • stop_reason: The specific stop token or string that triggered stopping.

Usage

Use RequestOutput access when:

  • Extracting the generated text from VLM inference for downstream processing.
  • Analyzing generation confidence via log probabilities.
  • Debugging VLM outputs by inspecting token IDs and finish reasons.
  • Processing batch results where outputs must be matched to inputs.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/outputs.py (lines 22-65 for CompletionOutput, lines 86-193 for RequestOutput)

Signature

@dataclass
class CompletionOutput:
    index: int
    text: str
    token_ids: Sequence[int]
    cumulative_logprob: float | None
    logprobs: SampleLogprobs | None
    finish_reason: str | None = None
    stop_reason: int | str | None = None
    lora_request: LoRARequest | None = None

class RequestOutput:
    def __init__(
        self,
        request_id: str,
        prompt: str | None,
        prompt_token_ids: list[int] | None,
        prompt_logprobs: PromptLogprobs | None,
        outputs: list[CompletionOutput],
        finished: bool,
        metrics: RequestStateStats | None = None,
        lora_request: LoRARequest | None = None,
        encoder_prompt: str | None = None,
        encoder_prompt_token_ids: list[int] | None = None,
        num_cached_tokens: int | None = None,
        *,
        multi_modal_placeholders: MultiModalPlaceholderDict | None = None,
        kv_transfer_params: dict[str, Any] | None = None,
    ) -> None: ...

Import

from vllm.outputs import RequestOutput, CompletionOutput

I/O Contract

Inputs

Name Type Required Description
outputs list[RequestOutput] Yes Return value from LLM.generate()

Outputs

Name Type Description
text str Generated text (image description, VQA answer, OCR result, etc.)
token_ids Sequence[int] Token IDs of the generated output
cumulative_logprob None Cumulative log probability of the entire generated sequence
finish_reason None Reason generation stopped: "stop", "length", or None
stop_reason str | None Specific stop token/string that triggered termination
prompt None Original prompt string for correlation with input
request_id str Unique request identifier for batch tracking

Usage Examples

Basic Text Extraction from VLM Output

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    max_model_len=4096,
    limit_mm_per_prompt={"image": 1},
)

image = ImageAsset("cherry_blossom").pil_image
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

outputs = llm.generate(
    {"prompt": prompt, "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(temperature=0, max_tokens=128),
)

# Extract the generated text
generated_text = outputs[0].outputs[0].text
print(generated_text)
# Example: "The image shows cherry blossom trees in full bloom..."

Processing Batch Results

# After batch generation
outputs = llm.generate(batch_prompts, sampling_params=sampling_params)

for i, output in enumerate(outputs):
    generated_text = output.outputs[0].text
    finish_reason = output.outputs[0].finish_reason
    print(f"Request {i}: {generated_text}")
    print(f"  Finish reason: {finish_reason}")
    print(f"  Tokens generated: {len(output.outputs[0].token_ids)}")

Checking Finish Reason and Confidence

output = outputs[0]
completion = output.outputs[0]

# Check if generation completed naturally
if completion.finish_reason == "stop":
    print("Generation completed naturally (hit stop token)")
elif completion.finish_reason == "length":
    print("Generation truncated (hit max_tokens limit)")

# Check confidence via cumulative log probability
if completion.cumulative_logprob is not None:
    print(f"Cumulative log probability: {completion.cumulative_logprob}")

Full Pipeline: Image Captioning with Output Processing

from vllm import LLM, SamplingParams
from PIL import Image

# Setup
llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    max_model_len=4096,
    limit_mm_per_prompt={"image": 1},
)

# Load image
image = Image.open("/path/to/photo.jpg").convert("RGB")

# Generate
outputs = llm.generate(
    {
        "prompt": "USER: <image>\nDescribe this image.\nASSISTANT:",
        "multi_modal_data": {"image": image},
    },
    sampling_params=SamplingParams(temperature=0, max_tokens=256),
)

# Process output
result = outputs[0]
caption = result.outputs[0].text.strip()
num_tokens = len(result.outputs[0].token_ids)
was_truncated = result.outputs[0].finish_reason == "length"

print(f"Caption: {caption}")
print(f"Generated {num_tokens} tokens")
if was_truncated:
    print("Warning: output was truncated, consider increasing max_tokens")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment