Principle:Deepseek ai Janus Input Tokenization and Batching
| Knowledge Sources | |
|---|---|
| Domains | NLP, Multimodal_AI |
| Last Updated | 2026-02-10 09:30 GMT |
Overview
A procedure for converting conversation text and images into tokenized, batched tensor inputs with special image token interleaving and mask construction for multimodal model inference.
Description
Input tokenization and batching is the central preprocessing step that transforms raw conversation messages and PIL images into the tensor format required by the multimodal model. This involves:
- SFT template application: Format conversation into the DeepSeek prompt template
- Tokenization: Encode the formatted prompt into token IDs
- Image token insertion: Replace each <image_placeholder> token with a sequence of <begin_of_image> + 576 image tokens + <end_of_image>
- Image preprocessing: Resize, rescale, and normalize images via VLMImageProcessor
- Batching: Left-pad sequences and construct attention masks, image sequence masks, and image embedding masks
The output is a BatchedVLChatProcessorOutput containing all tensors needed for the model's prepare_inputs_embeds method.
Usage
Use this principle after loading conversation messages and images, and before calling the model's embedding fusion method. It is the bridge between raw inputs and tensor-ready model inputs.
Theoretical Basis
The key insight is the image token interleaving strategy:
- Each <image_placeholder> in the tokenized sequence is replaced by a fixed-length block of 576 image tokens (matching the vision encoder's output length for a 384x384 image with 16x16 patches)
- Special boundary tokens (<begin_of_image>, <end_of_image>) mark the image region
- Boolean masks track which positions in the token sequence correspond to images (images_seq_mask) and which image embeddings are valid (images_emb_mask)
Left-padding is used for batching (rather than right-padding) because the model generates tokens autoregressively from the right side of the sequence.