Principle:Deepseek ai Janus Input Tokenization and Batching

Knowledge Sources	Janus Janus: Decoupling Visual Encoding
Domains	NLP, Multimodal_AI
Last Updated	2026-02-10 09:30 GMT

Overview

A procedure for converting conversation text and images into tokenized, batched tensor inputs with special image token interleaving and mask construction for multimodal model inference.

Description

Input tokenization and batching is the central preprocessing step that transforms raw conversation messages and PIL images into the tensor format required by the multimodal model. This involves:

SFT template application: Format conversation into the DeepSeek prompt template
Tokenization: Encode the formatted prompt into token IDs
Image token insertion: Replace each <image_placeholder> token with a sequence of <begin_of_image> + 576 image tokens + <end_of_image>
Image preprocessing: Resize, rescale, and normalize images via VLMImageProcessor
Batching: Left-pad sequences and construct attention masks, image sequence masks, and image embedding masks

The output is a BatchedVLChatProcessorOutput containing all tensors needed for the model's prepare_inputs_embeds method.

Usage

Use this principle after loading conversation messages and images, and before calling the model's embedding fusion method. It is the bridge between raw inputs and tensor-ready model inputs.

Theoretical Basis

The key insight is the image token interleaving strategy:

Each <image_placeholder> in the tokenized sequence is replaced by a fixed-length block of 576 image tokens (matching the vision encoder's output length for a 384x384 image with 16x16 patches)
Special boundary tokens (<begin_of_image>, <end_of_image>) mark the image region
Boolean masks track which positions in the token sequence correspond to images (images_seq_mask) and which image embeddings are valid (images_emb_mask)

Left-padding is used for batching (rather than right-padding) because the model generates tokens autoregressively from the right side of the sequence.

Related Pages

Implemented By

Implementation:Deepseek_ai_Janus_VLChatProcessor_Call

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment