Implementation:Sgl project Sglang Multimodal Data Loading

Knowledge Sources	SGLang
Domains	Vision, Multimodal, Data_Processing
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for loading and normalizing visual inputs (images, videos) for SGLang vision-language model inference.

Description

The _load_single_item method in BaseMultimodalProcessor handles loading individual images from various formats (URL, file path, base64, PIL). For video inputs, get_estimated_frames_list extracts frames. The image_data parameter of Engine.generate accepts these formats directly — the processor handles normalization internally.

Usage

Pass image data directly to Engine.generate(image_data=...) in any supported format. The SGLang processor handles format detection and conversion automatically.

Code Reference

Source Location

Repository: sglang
File: python/sglang/srt/multimodal/processors/base_processor.py
Lines: L401-423 (_load_single_item), L377-399 (get_estimated_frames_list)

Signature

# Internal loading method
def _load_single_item(
    self,
    item: Union[str, PIL.Image.Image, Dict],
) -> Union[PIL.Image.Image, Dict]:
    """Load a single image from URL, path, base64, PIL, or dict format."""

# User-facing: pass image_data to Engine.generate
engine.generate(
    prompt="Describe this image",
    sampling_params={"max_new_tokens": 128},
    image_data=image_data,  # URL, PIL, base64, list of images, etc.
)

Import

# No direct import needed - used via Engine.generate image_data parameter
import sglang as sgl
engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")

I/O Contract

Inputs

Name	Type	Required	Description
image_data	Union[str, PIL.Image, List]	Yes	Image(s) as URL, file path, base64, PIL Image, or list thereof
video_data	Union[str, List]	No	Video file path(s) or "video:<path>" strings

Outputs

Name	Type	Description
processed images	PIL.Image or tensor	Normalized images ready for model processor

Usage Examples

Image from URL

import sglang as sgl

engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")

output = engine.generate(
    prompt="<image>\nDescribe this image in detail.",
    sampling_params={"max_new_tokens": 256, "temperature": 0},
    image_data="https://example.com/photo.jpg",
)
print(output["text"])

Multiple Images

output = engine.generate(
    prompt="<image><image>\nCompare these two images.",
    sampling_params={"max_new_tokens": 256},
    image_data=["image1.jpg", "image2.jpg"],
)

PIL Image Object

from PIL import Image

img = Image.open("photo.jpg")
output = engine.generate(
    prompt="<image>\nWhat do you see?",
    sampling_params={"max_new_tokens": 128},
    image_data=img,
)

Related Pages

Implements Principle

Principle:Sgl_project_Sglang_Visual_Input_Preparation

Requires Environment

Environment:Sgl_project_Sglang_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment