Workflow:Sgl project Sglang Multimodal Vision Language Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Multimodal, Vision_Language_Models |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
End-to-end process for running inference on vision-language models (VLMs) using SGLang, supporting image and video understanding tasks.
Description
This workflow covers serving and querying multimodal models that accept both text and visual inputs (images, videos). SGLang supports a wide range of VLMs including LLaVA, LLaVA-OneVision, Qwen2-VL, Qwen3-VL, Pixtral, and others. The workflow supports both offline batch inference via the Engine API and online serving via the OpenAI-compatible vision API. Visual inputs can be provided as URLs, base64-encoded data, or local file paths.
Usage
Execute this workflow when you need to process images or videos alongside text prompts — for example, image captioning, visual question answering, document understanding, video analysis, or multimodal data extraction. Requires a VLM model and GPU resources with sufficient memory for both the language model and vision encoder.
Execution Steps
Step 1: Select and Load a Vision Language Model
Choose a supported VLM architecture and load it using either the SGLang server or the offline Engine. Specify the model path and set the appropriate chat template. Multi-GPU tensor parallelism is supported for large VLMs.
Key considerations:
- Use --model-path with a VLM hub ID (e.g., Qwen/Qwen2-VL-7B-Instruct)
- Chat template is auto-detected but can be overridden with --chat-template
- Large VLMs (e.g., 72B) require multi-GPU with --tp flag
- The vision encoder is loaded alongside the language model automatically
Step 2: Prepare Visual Inputs
Gather images or video frames to process. Visual inputs can be provided in multiple formats: HTTP URLs pointing to images, base64-encoded image data, or local file paths. For video inputs, frames are extracted at a configurable rate and encoded as a sequence of images.
Key considerations:
- Image URLs are fetched by the server at request time
- Base64 encoding avoids network round-trips for local images
- Video inputs require frame extraction (e.g., using decord library)
- Multiple images can be included in a single request
Step 3: Construct Multimodal Prompts
Build prompts that combine text instructions with image placeholders. For the OpenAI-compatible API, use the content array format with text and image_url entries. For the offline Engine, use the image_token placeholder in the text prompt and pass image_data separately.
Key considerations:
- OpenAI API format: messages with content as array of text/image_url objects
- Offline Engine: use image_token from the chat template in the prompt string
- Multiple images supported via multiple image_url entries or image_data list
- Video frames are passed as multiple sequential image inputs
Step 4: Execute Inference
Submit the multimodal request to the server or Engine. The vision encoder processes the visual inputs into embeddings, which are then concatenated with text token embeddings for the language model's attention computation.
Key considerations:
- The vision encoder adds computational overhead proportional to image resolution
- Streaming is supported for real-time response delivery
- Batch processing works with mixed text-only and multimodal requests
- CUDA graphs can be enabled for the encoder to accelerate repeated calls
Step 5: Process Visual Understanding Results
Extract the generated text response which contains the model's understanding of the visual input. Responses can include image descriptions, answers to visual questions, extracted text from documents, or structured data.
Key considerations:
- Output format matches standard text generation (text field in response)
- Multimodal models may produce longer outputs for detailed visual descriptions
- Quality depends on model capability and image resolution