Implementation:Deepseek ai Janus VLChatProcessor Call
| Knowledge Sources | |
|---|---|
| Domains | NLP, Multimodal_AI |
| Last Updated | 2026-02-10 09:30 GMT |
Overview
Concrete tool for tokenizing multimodal conversations with image token interleaving and batching provided by the Janus VLChatProcessor.
Description
The VLChatProcessor.__call__ method is the main entry point for preprocessing multimodal inputs. It delegates to process_one (which handles SFT formatting, tokenization, and image token insertion) and batchify (which left-pads and constructs batch tensors with all required masks).
Usage
Call the VLChatProcessor instance directly with conversation dicts and PIL images. The force_batchify=True default means a single input is automatically wrapped into a batch.
Code Reference
Source Location
- Repository: Janus
- File: janus/models/processing_vlm.py
- Lines: L322-355 (__call__), L260-320 (process_one), L357-418 (batchify)
Signature
class VLChatProcessor(ProcessorMixin):
def __call__(
self,
*,
prompt: str = None,
conversations: List[Dict[str, str]] = None,
images: List[Image] = None,
force_batchify: bool = True,
**kwargs,
) -> BatchedVLChatProcessorOutput:
"""
Args:
prompt: Pre-formatted prompt string (mutually exclusive with conversations)
conversations: List of message dicts with "role", "content", optional "images"
images: List of PIL images corresponding to <image_placeholder> tokens
force_batchify: Whether to auto-batch a single input (default True)
Returns:
BatchedVLChatProcessorOutput with fields:
input_ids [b, T], attention_mask [b, T],
pixel_values [b, n_images, 3, H, W],
images_seq_mask [b, T], images_emb_mask [b, n_images, n_tokens],
sft_format [b]
"""
Import
from janus.models import VLChatProcessor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| conversations | List[Dict[str, str]] | Yes* | Message dicts with role, content, optional images keys |
| images | List[PIL.Image.Image] | Yes* | PIL images matching <image_placeholder> tokens in content |
| prompt | str | No | Pre-formatted prompt (alternative to conversations) |
| force_batchify | bool | No | Auto-batch single input (default True) |
Outputs
| Name | Type | Description |
|---|---|---|
| input_ids | torch.LongTensor [b, T] | Tokenized IDs with image tokens interleaved |
| attention_mask | torch.LongTensor [b, T] | 1 for real tokens, 0 for padding |
| pixel_values | torch.FloatTensor [b, n_images, 3, H, W] | Preprocessed image tensors |
| images_seq_mask | torch.BoolTensor [b, T] | True at image token positions |
| images_emb_mask | torch.BoolTensor [b, n_images, n_tokens] | True for valid image embeddings |
| sft_format | List[str] | Formatted prompt strings |
Usage Examples
Multimodal Understanding
from janus.utils.io import load_pil_images
conversation = [
{
"role": "User",
"content": "<image_placeholder>\nDescribe this image.",
"images": ["./examples/image.png"],
},
{"role": "Assistant", "content": ""},
]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# prepare_inputs now contains all tensors for model input
# input_ids: [1, T] with image tokens
# pixel_values: [1, 1, 3, 384, 384]
# images_seq_mask: [1, T] with True at 576 image positions