Implementation:Deepseek ai Janus VLChatProcessor Call

Knowledge Sources	Janus
Domains	NLP, Multimodal_AI
Last Updated	2026-02-10 09:30 GMT

Overview

Concrete tool for tokenizing multimodal conversations with image token interleaving and batching provided by the Janus VLChatProcessor.

Description

The VLChatProcessor.__call__ method is the main entry point for preprocessing multimodal inputs. It delegates to process_one (which handles SFT formatting, tokenization, and image token insertion) and batchify (which left-pads and constructs batch tensors with all required masks).

Usage

Call the VLChatProcessor instance directly with conversation dicts and PIL images. The force_batchify=True default means a single input is automatically wrapped into a batch.

Code Reference

Source Location

Repository: Janus
File: janus/models/processing_vlm.py
Lines: L322-355 (__call__), L260-320 (process_one), L357-418 (batchify)

Signature

class VLChatProcessor(ProcessorMixin):
    def __call__(
        self,
        *,
        prompt: str = None,
        conversations: List[Dict[str, str]] = None,
        images: List[Image] = None,
        force_batchify: bool = True,
        **kwargs,
    ) -> BatchedVLChatProcessorOutput:
        """
        Args:
            prompt: Pre-formatted prompt string (mutually exclusive with conversations)
            conversations: List of message dicts with "role", "content", optional "images"
            images: List of PIL images corresponding to <image_placeholder> tokens
            force_batchify: Whether to auto-batch a single input (default True)

        Returns:
            BatchedVLChatProcessorOutput with fields:
                input_ids [b, T], attention_mask [b, T],
                pixel_values [b, n_images, 3, H, W],
                images_seq_mask [b, T], images_emb_mask [b, n_images, n_tokens],
                sft_format [b]
        """

Import

from janus.models import VLChatProcessor

I/O Contract

Inputs

Name	Type	Required	Description
conversations	List[Dict[str, str]]	Yes*	Message dicts with role, content, optional images keys
images	List[PIL.Image.Image]	Yes*	PIL images matching <image_placeholder> tokens in content
prompt	str	No	Pre-formatted prompt (alternative to conversations)
force_batchify	bool	No	Auto-batch single input (default True)

Outputs

Name	Type	Description
input_ids	torch.LongTensor [b, T]	Tokenized IDs with image tokens interleaved
attention_mask	torch.LongTensor [b, T]	1 for real tokens, 0 for padding
pixel_values	torch.FloatTensor [b, n_images, 3, H, W]	Preprocessed image tensors
images_seq_mask	torch.BoolTensor [b, T]	True at image token positions
images_emb_mask	torch.BoolTensor [b, n_images, n_tokens]	True for valid image embeddings
sft_format	List[str]	Formatted prompt strings

Usage Examples

Multimodal Understanding

from janus.utils.io import load_pil_images

conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>\nDescribe this image.",
        "images": ["./examples/image.png"],
    },
    {"role": "Assistant", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# prepare_inputs now contains all tensors for model input
# input_ids: [1, T] with image tokens
# pixel_values: [1, 1, 3, 384, 384]
# images_seq_mask: [1, T] with True at 576 image positions

Related Pages

Implements Principle

Principle:Deepseek_ai_Janus_Input_Tokenization_and_Batching

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment