Implementation:Sgl project Sglang Engine Generate Multimodal
| Knowledge Sources | |
|---|---|
| Domains | Vision, Multimodal, Inference |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for executing multimodal inference with images/videos using the SGLang Engine generate method.
Description
The Engine.generate method's image_data, video_data, and audio_data parameters enable multimodal inference. When these are provided, the engine routes the request through the model's multimodal processor (auto-detected from model config) before generation. The prompt must include appropriate image/video tokens at positions where visual information should be injected.
Usage
Call Engine.generate with image_data for image understanding tasks, video_data for video analysis, or both for complex multimodal scenarios. Ensure the prompt includes image tokens matching the number of provided images.
Code Reference
Source Location
- Repository: sglang
- File: python/sglang/srt/entrypoints/engine.py
- Lines: L205-293 (generate method with multimodal params at L220)
- Multimodal processing: python/sglang/srt/multimodal/processors/base_processor.py:L304-364
Signature
def generate(
self,
prompt: Optional[Union[List[str], str]] = None,
sampling_params: Optional[Union[List[Dict], Dict]] = None,
image_data: Optional[MultimodalDataInputFormat] = None,
audio_data: Optional[MultimodalDataInputFormat] = None,
video_data: Optional[MultimodalDataInputFormat] = None,
# ... other params
) -> Union[Dict, Iterator[Dict]]:
"""Generate with multimodal inputs."""
Import
import sglang as sgl
engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str | Yes | Text prompt with image/video tokens |
| sampling_params | Dict | No | Sampling parameters |
| image_data | MultimodalDataInputFormat | No | Image(s) — URL, PIL, base64, list |
| video_data | MultimodalDataInputFormat | No | Video file path(s) |
| audio_data | MultimodalDataInputFormat | No | Audio file path(s) |
Outputs
| Name | Type | Description |
|---|---|---|
| result | Dict | Keys: "text" (model's response), "meta_info", token counts |
Usage Examples
Single Image QA
import sglang as sgl
engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")
output = engine.generate(
prompt="<image>\nWhat objects are visible in this image?",
sampling_params={"max_new_tokens": 256, "temperature": 0},
image_data="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a7/Example.jpg",
)
print(output["text"])
Batch VLM Inference
prompts = [
"<image>\nDescribe this image.",
"<image>\nWhat color is the main object?",
]
image_data = [
"image1.jpg",
"image2.jpg",
]
outputs = engine.generate(
prompt=prompts,
sampling_params={"max_new_tokens": 128},
image_data=image_data,
)
for out in outputs:
print(out["text"])