Principle:Sgl project Sglang Multimodal Prompt Construction

Knowledge Sources	SGLang
Domains	Vision, Multimodal, Prompt_Engineering
Last Updated	2026-02-10 00:00 GMT

Overview

A prompt formatting pattern that inserts model-specific image/video placeholder tokens into text prompts for vision-language model input.

Description

Different VLMs use different special tokens to mark where visual information should be injected into the text stream (e.g., <image>, <|image|>, <|vision_start|>). Multimodal prompt construction handles inserting these model-specific tokens at the correct positions in the prompt. For the OpenAI-compatible API, prompts use a content array with image_url type entries instead of explicit tokens. SGLang's MultimodalSpecialTokens dataclass tracks the correct tokens for each model architecture.

Usage

Construct multimodal prompts whenever passing images or videos alongside text to a VLM. Use explicit image tokens for the Engine API, or the OpenAI content array format for the HTTP API.

Theoretical Basis

VLMs process multimodal inputs by:

Replacing image tokens in the text with visual feature embeddings
The visual encoder processes the image into a sequence of feature vectors
These vectors are inserted at the image token position in the text embedding sequence
The combined text+visual embeddings are processed by the language model

Prompt format patterns:

Engine API: <image>\nDescribe this image.
OpenAI API: [{"type": "image_url", ...}, {"type": "text", ...}]

Related Pages

Implemented By

Implementation:Sgl_project_Sglang_Multimodal_Special_Tokens

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment