Workflow:Sgl project Sglang Multimodal Vision Language Inference

Knowledge Sources	SGLang SGLang Docs
Domains	LLM_Inference, Multimodal, Vision_Language_Models
Last Updated	2026-02-09 00:00 GMT

Overview

End-to-end process for running inference on vision-language models (VLMs) using SGLang, supporting image and video understanding tasks.

Description

This workflow covers serving and querying multimodal models that accept both text and visual inputs (images, videos). SGLang supports a wide range of VLMs including LLaVA, LLaVA-OneVision, Qwen2-VL, Qwen3-VL, Pixtral, and others. The workflow supports both offline batch inference via the Engine API and online serving via the OpenAI-compatible vision API. Visual inputs can be provided as URLs, base64-encoded data, or local file paths.

Usage

Execute this workflow when you need to process images or videos alongside text prompts — for example, image captioning, visual question answering, document understanding, video analysis, or multimodal data extraction. Requires a VLM model and GPU resources with sufficient memory for both the language model and vision encoder.

Execution Steps

Step 1: Select and Load a Vision Language Model

Choose a supported VLM architecture and load it using either the SGLang server or the offline Engine. Specify the model path and set the appropriate chat template. Multi-GPU tensor parallelism is supported for large VLMs.

Key considerations:

Use --model-path with a VLM hub ID (e.g., Qwen/Qwen2-VL-7B-Instruct)
Chat template is auto-detected but can be overridden with --chat-template
Large VLMs (e.g., 72B) require multi-GPU with --tp flag
The vision encoder is loaded alongside the language model automatically

Step 2: Prepare Visual Inputs

Gather images or video frames to process. Visual inputs can be provided in multiple formats: HTTP URLs pointing to images, base64-encoded image data, or local file paths. For video inputs, frames are extracted at a configurable rate and encoded as a sequence of images.

Key considerations:

Image URLs are fetched by the server at request time
Base64 encoding avoids network round-trips for local images
Video inputs require frame extraction (e.g., using decord library)
Multiple images can be included in a single request

Step 3: Construct Multimodal Prompts

Build prompts that combine text instructions with image placeholders. For the OpenAI-compatible API, use the content array format with text and image_url entries. For the offline Engine, use the image_token placeholder in the text prompt and pass image_data separately.

Key considerations:

OpenAI API format: messages with content as array of text/image_url objects
Offline Engine: use image_token from the chat template in the prompt string
Multiple images supported via multiple image_url entries or image_data list
Video frames are passed as multiple sequential image inputs

Step 4: Execute Inference

Submit the multimodal request to the server or Engine. The vision encoder processes the visual inputs into embeddings, which are then concatenated with text token embeddings for the language model's attention computation.

Key considerations:

The vision encoder adds computational overhead proportional to image resolution
Streaming is supported for real-time response delivery
Batch processing works with mixed text-only and multimodal requests
CUDA graphs can be enabled for the encoder to accelerate repeated calls

Step 5: Process Visual Understanding Results

Extract the generated text response which contains the model's understanding of the visual input. Responses can include image descriptions, answers to visual questions, extracted text from documents, or structured data.

Key considerations:

Output format matches standard text generation (text field in response)
Multimodal models may produce longer outputs for detailed visual descriptions
Quality depends on model capability and image resolution

Execution Diagram

GitHub URL

Workflow Repository