Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Sgl project Sglang Multimodal Prompt Construction

From Leeroopedia


Knowledge Sources
Domains Vision, Multimodal, Prompt_Engineering
Last Updated 2026-02-10 00:00 GMT

Overview

A prompt formatting pattern that inserts model-specific image/video placeholder tokens into text prompts for vision-language model input.

Description

Different VLMs use different special tokens to mark where visual information should be injected into the text stream (e.g., <image>, <|image|>, <|vision_start|>). Multimodal prompt construction handles inserting these model-specific tokens at the correct positions in the prompt. For the OpenAI-compatible API, prompts use a content array with image_url type entries instead of explicit tokens. SGLang's MultimodalSpecialTokens dataclass tracks the correct tokens for each model architecture.

Usage

Construct multimodal prompts whenever passing images or videos alongside text to a VLM. Use explicit image tokens for the Engine API, or the OpenAI content array format for the HTTP API.

Theoretical Basis

VLMs process multimodal inputs by:

  1. Replacing image tokens in the text with visual feature embeddings
  2. The visual encoder processes the image into a sequence of feature vectors
  3. These vectors are inserted at the image token position in the text embedding sequence
  4. The combined text+visual embeddings are processed by the language model

Prompt format patterns:

  • Engine API: <image>\nDescribe this image.
  • OpenAI API: [{"type": "image_url", ...}, {"type": "text", ...}]

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment