Implementation:Sgl project Sglang Multimodal Special Tokens
| Knowledge Sources | |
|---|---|
| Domains | Vision, Multimodal, Prompt_Engineering |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for managing model-specific image and video placeholder tokens in SGLang multimodal prompts.
Description
The MultimodalSpecialTokens dataclass stores the special tokens (image_token, image_token_id, video_token, etc.) for each VLM architecture. These tokens are auto-detected from the model configuration during loading. When constructing prompts for the Engine API, users include these tokens manually. For the OpenAI API, SGLang handles token insertion automatically from the content array format.
Usage
For Engine API usage, include the model's image token (typically <image>) in your prompt text at the position(s) where images should be processed. For OpenAI API usage, use the content array format and let SGLang handle token insertion.
Code Reference
Source Location
- Repository: sglang
- File: python/sglang/srt/multimodal/processors/base_processor.py
- Lines: L77-171 (MultimodalSpecialTokens dataclass)
Signature
@dataclass
class MultimodalSpecialTokens:
image_token: Optional[str] = None # e.g., "<image>"
image_token_id: Optional[int] = None
video_token: Optional[str] = None # e.g., "<video>"
video_token_id: Optional[int] = None
audio_token: Optional[str] = None
audio_token_id: Optional[int] = None
# ... additional token fields
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str | Yes | Text prompt with image token placeholders (Engine API) |
| messages | List[Dict] | Yes | OpenAI-format content array with image_url entries (HTTP API) |
Outputs
| Name | Type | Description |
|---|---|---|
| formatted_prompt | str | Prompt with correct image tokens for the loaded model |
Usage Examples
Engine API with Image Token
import sglang as sgl
engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")
# Use <image> token as placeholder
output = engine.generate(
prompt="<image>\nWhat is shown in this image?",
sampling_params={"max_new_tokens": 128, "temperature": 0},
image_data="https://example.com/photo.jpg",
)
print(output["text"])
OpenAI API with Content Array
response = client.chat.completions.create(
model="llava-hf/llava-onevision-qwen2-7b-ov-hf",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
{"type": "text", "text": "What is shown in this image?"},
],
}],
max_tokens=128,
)