Implementation:Sgl project Sglang Multimodal Data Loading
| Knowledge Sources | |
|---|---|
| Domains | Vision, Multimodal, Data_Processing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for loading and normalizing visual inputs (images, videos) for SGLang vision-language model inference.
Description
The _load_single_item method in BaseMultimodalProcessor handles loading individual images from various formats (URL, file path, base64, PIL). For video inputs, get_estimated_frames_list extracts frames. The image_data parameter of Engine.generate accepts these formats directly — the processor handles normalization internally.
Usage
Pass image data directly to Engine.generate(image_data=...) in any supported format. The SGLang processor handles format detection and conversion automatically.
Code Reference
Source Location
- Repository: sglang
- File: python/sglang/srt/multimodal/processors/base_processor.py
- Lines: L401-423 (_load_single_item), L377-399 (get_estimated_frames_list)
Signature
# Internal loading method
def _load_single_item(
self,
item: Union[str, PIL.Image.Image, Dict],
) -> Union[PIL.Image.Image, Dict]:
"""Load a single image from URL, path, base64, PIL, or dict format."""
# User-facing: pass image_data to Engine.generate
engine.generate(
prompt="Describe this image",
sampling_params={"max_new_tokens": 128},
image_data=image_data, # URL, PIL, base64, list of images, etc.
)
Import
# No direct import needed - used via Engine.generate image_data parameter
import sglang as sgl
engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| image_data | Union[str, PIL.Image, List] | Yes | Image(s) as URL, file path, base64, PIL Image, or list thereof |
| video_data | Union[str, List] | No | Video file path(s) or "video:<path>" strings |
Outputs
| Name | Type | Description |
|---|---|---|
| processed images | PIL.Image or tensor | Normalized images ready for model processor |
Usage Examples
Image from URL
import sglang as sgl
engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")
output = engine.generate(
prompt="<image>\nDescribe this image in detail.",
sampling_params={"max_new_tokens": 256, "temperature": 0},
image_data="https://example.com/photo.jpg",
)
print(output["text"])
Multiple Images
output = engine.generate(
prompt="<image><image>\nCompare these two images.",
sampling_params={"max_new_tokens": 256},
image_data=["image1.jpg", "image2.jpg"],
)
PIL Image Object
from PIL import Image
img = Image.open("photo.jpg")
output = engine.generate(
prompt="<image>\nWhat do you see?",
sampling_params={"max_new_tokens": 128},
image_data=img,
)