Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Sgl project Sglang Engine Generate Multimodal

From Leeroopedia


Knowledge Sources
Domains Vision, Multimodal, Inference
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for executing multimodal inference with images/videos using the SGLang Engine generate method.

Description

The Engine.generate method's image_data, video_data, and audio_data parameters enable multimodal inference. When these are provided, the engine routes the request through the model's multimodal processor (auto-detected from model config) before generation. The prompt must include appropriate image/video tokens at positions where visual information should be injected.

Usage

Call Engine.generate with image_data for image understanding tasks, video_data for video analysis, or both for complex multimodal scenarios. Ensure the prompt includes image tokens matching the number of provided images.

Code Reference

Source Location

  • Repository: sglang
  • File: python/sglang/srt/entrypoints/engine.py
  • Lines: L205-293 (generate method with multimodal params at L220)
  • Multimodal processing: python/sglang/srt/multimodal/processors/base_processor.py:L304-364

Signature

def generate(
    self,
    prompt: Optional[Union[List[str], str]] = None,
    sampling_params: Optional[Union[List[Dict], Dict]] = None,
    image_data: Optional[MultimodalDataInputFormat] = None,
    audio_data: Optional[MultimodalDataInputFormat] = None,
    video_data: Optional[MultimodalDataInputFormat] = None,
    # ... other params
) -> Union[Dict, Iterator[Dict]]:
    """Generate with multimodal inputs."""

Import

import sglang as sgl
engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")

I/O Contract

Inputs

Name Type Required Description
prompt str Yes Text prompt with image/video tokens
sampling_params Dict No Sampling parameters
image_data MultimodalDataInputFormat No Image(s) — URL, PIL, base64, list
video_data MultimodalDataInputFormat No Video file path(s)
audio_data MultimodalDataInputFormat No Audio file path(s)

Outputs

Name Type Description
result Dict Keys: "text" (model's response), "meta_info", token counts

Usage Examples

Single Image QA

import sglang as sgl

engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")

output = engine.generate(
    prompt="<image>\nWhat objects are visible in this image?",
    sampling_params={"max_new_tokens": 256, "temperature": 0},
    image_data="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Example.jpg",
)
print(output["text"])

Batch VLM Inference

prompts = [
    "<image>\nDescribe this image.",
    "<image>\nWhat color is the main object?",
]
image_data = [
    "image1.jpg",
    "image2.jpg",
]
outputs = engine.generate(
    prompt=prompts,
    sampling_params={"max_new_tokens": 128},
    image_data=image_data,
)
for out in outputs:
    print(out["text"])

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment