Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepseek ai Janus VLChatProcessor Call

From Leeroopedia
Revision as of 14:45, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Deepseek_ai_Janus_VLChatProcessor_Call.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Multimodal_AI
Last Updated 2026-02-10 09:30 GMT

Overview

Concrete tool for tokenizing multimodal conversations with image token interleaving and batching provided by the Janus VLChatProcessor.

Description

The VLChatProcessor.__call__ method is the main entry point for preprocessing multimodal inputs. It delegates to process_one (which handles SFT formatting, tokenization, and image token insertion) and batchify (which left-pads and constructs batch tensors with all required masks).

Usage

Call the VLChatProcessor instance directly with conversation dicts and PIL images. The force_batchify=True default means a single input is automatically wrapped into a batch.

Code Reference

Source Location

  • Repository: Janus
  • File: janus/models/processing_vlm.py
  • Lines: L322-355 (__call__), L260-320 (process_one), L357-418 (batchify)

Signature

class VLChatProcessor(ProcessorMixin):
    def __call__(
        self,
        *,
        prompt: str = None,
        conversations: List[Dict[str, str]] = None,
        images: List[Image] = None,
        force_batchify: bool = True,
        **kwargs,
    ) -> BatchedVLChatProcessorOutput:
        """
        Args:
            prompt: Pre-formatted prompt string (mutually exclusive with conversations)
            conversations: List of message dicts with "role", "content", optional "images"
            images: List of PIL images corresponding to <image_placeholder> tokens
            force_batchify: Whether to auto-batch a single input (default True)

        Returns:
            BatchedVLChatProcessorOutput with fields:
                input_ids [b, T], attention_mask [b, T],
                pixel_values [b, n_images, 3, H, W],
                images_seq_mask [b, T], images_emb_mask [b, n_images, n_tokens],
                sft_format [b]
        """

Import

from janus.models import VLChatProcessor

I/O Contract

Inputs

Name Type Required Description
conversations List[Dict[str, str]] Yes* Message dicts with role, content, optional images keys
images List[PIL.Image.Image] Yes* PIL images matching <image_placeholder> tokens in content
prompt str No Pre-formatted prompt (alternative to conversations)
force_batchify bool No Auto-batch single input (default True)

Outputs

Name Type Description
input_ids torch.LongTensor [b, T] Tokenized IDs with image tokens interleaved
attention_mask torch.LongTensor [b, T] 1 for real tokens, 0 for padding
pixel_values torch.FloatTensor [b, n_images, 3, H, W] Preprocessed image tensors
images_seq_mask torch.BoolTensor [b, T] True at image token positions
images_emb_mask torch.BoolTensor [b, n_images, n_tokens] True for valid image embeddings
sft_format List[str] Formatted prompt strings

Usage Examples

Multimodal Understanding

from janus.utils.io import load_pil_images

conversation = [
    {
        "role": "User",
        "content": "<image_placeholder>\nDescribe this image.",
        "images": ["./examples/image.png"],
    },
    {"role": "Assistant", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# prepare_inputs now contains all tensors for model input
# input_ids: [1, T] with image tokens
# pixel_values: [1, 1, 3, 384, 384]
# images_seq_mask: [1, T] with True at 576 image positions

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment