Principle:Sgl project Sglang Multimodal Prompt Construction
| Knowledge Sources | |
|---|---|
| Domains | Vision, Multimodal, Prompt_Engineering |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A prompt formatting pattern that inserts model-specific image/video placeholder tokens into text prompts for vision-language model input.
Description
Different VLMs use different special tokens to mark where visual information should be injected into the text stream (e.g., <image>, <|image|>, <|vision_start|>). Multimodal prompt construction handles inserting these model-specific tokens at the correct positions in the prompt. For the OpenAI-compatible API, prompts use a content array with image_url type entries instead of explicit tokens. SGLang's MultimodalSpecialTokens dataclass tracks the correct tokens for each model architecture.
Usage
Construct multimodal prompts whenever passing images or videos alongside text to a VLM. Use explicit image tokens for the Engine API, or the OpenAI content array format for the HTTP API.
Theoretical Basis
VLMs process multimodal inputs by:
- Replacing image tokens in the text with visual feature embeddings
- The visual encoder processes the image into a sequence of feature vectors
- These vectors are inserted at the image token position in the text embedding sequence
- The combined text+visual embeddings are processed by the language model
Prompt format patterns:
- Engine API: <image>\nDescribe this image.
- OpenAI API: [{"type": "image_url", ...}, {"type": "text", ...}]