Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sgl project Sglang Multimodal Special Tokens

From Leeroopedia
Revision as of 16:40, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Sgl_project_Sglang_Multimodal_Special_Tokens.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Vision, Multimodal, Prompt_Engineering
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for managing model-specific image and video placeholder tokens in SGLang multimodal prompts.

Description

The MultimodalSpecialTokens dataclass stores the special tokens (image_token, image_token_id, video_token, etc.) for each VLM architecture. These tokens are auto-detected from the model configuration during loading. When constructing prompts for the Engine API, users include these tokens manually. For the OpenAI API, SGLang handles token insertion automatically from the content array format.

Usage

For Engine API usage, include the model's image token (typically <image>) in your prompt text at the position(s) where images should be processed. For OpenAI API usage, use the content array format and let SGLang handle token insertion.

Code Reference

Source Location

  • Repository: sglang
  • File: python/sglang/srt/multimodal/processors/base_processor.py
  • Lines: L77-171 (MultimodalSpecialTokens dataclass)

Signature

@dataclass
class MultimodalSpecialTokens:
    image_token: Optional[str] = None      # e.g., "<image>"
    image_token_id: Optional[int] = None
    video_token: Optional[str] = None      # e.g., "<video>"
    video_token_id: Optional[int] = None
    audio_token: Optional[str] = None
    audio_token_id: Optional[int] = None
    # ... additional token fields

I/O Contract

Inputs

Name Type Required Description
prompt str Yes Text prompt with image token placeholders (Engine API)
messages List[Dict] Yes OpenAI-format content array with image_url entries (HTTP API)

Outputs

Name Type Description
formatted_prompt str Prompt with correct image tokens for the loaded model

Usage Examples

Engine API with Image Token

import sglang as sgl

engine = sgl.Engine(model_path="llava-hf/llava-onevision-qwen2-7b-ov-hf")

# Use <image> token as placeholder
output = engine.generate(
    prompt="<image>\nWhat is shown in this image?",
    sampling_params={"max_new_tokens": 128, "temperature": 0},
    image_data="https://example.com/photo.jpg",
)
print(output["text"])

OpenAI API with Content Array

response = client.chat.completions.create(
    model="llava-hf/llava-onevision-qwen2-7b-ov-hf",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    }],
    max_tokens=128,
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment