Implementation:Deepseek ai Janus Prepare Inputs Embeds

Knowledge Sources	Janus
Domains	Computer_Vision, Multimodal_AI
Last Updated	2026-02-10 09:30 GMT

Overview

Concrete tool for encoding images and fusing vision embeddings with text embeddings provided by the Janus MultiModalityCausalLM model.

Description

The MultiModalityCausalLM.prepare_inputs_embeds method takes tokenized input IDs and preprocessed pixel values, runs the vision model (CLIPVisionTower wrapping SigLIP ViT) and MLP aligner, then replaces image token positions in the text embedding with the projected vision features.

Internally, the method uses einops.rearrange to handle batch and image dimensions and boolean masks to perform the embedding replacement.

Usage

Call this method after obtaining the BatchedVLChatProcessorOutput from the VLChatProcessor. The returned inputs_embeds tensor is passed directly to language_model.generate() for text generation.

Code Reference

Source Location

Repository: Janus
File: janus/models/modeling_vlm.py
Lines: L221-260

Signature

class MultiModalityCausalLM(MultiModalityPreTrainedModel):
    def prepare_inputs_embeds(
        self,
        input_ids: torch.LongTensor,       # [b, T]
        pixel_values: torch.FloatTensor,    # [b, n_images, 3, h, w]
        images_seq_mask: torch.LongTensor,  # [b, T]
        images_emb_mask: torch.LongTensor,  # [b, n_images, n_image_tokens]
        **kwargs,
    ) -> torch.Tensor:
        """
        Encode images and fuse with text embeddings.

        Returns:
            inputs_embeds: torch.Tensor [b, T, D] — fused embeddings
        """

Import

from janus.models import MultiModalityCausalLM
# Method is called on the model instance: vl_gpt.prepare_inputs_embeds(...)

I/O Contract

Inputs

Name	Type	Required	Description
input_ids	torch.LongTensor [b, T]	Yes	Tokenized input IDs with image placeholder tokens
pixel_values	torch.FloatTensor [b, n_images, 3, h, w]	Yes	Preprocessed image tensors
images_seq_mask	torch.BoolTensor [b, T]	Yes	True at positions corresponding to image tokens
images_emb_mask	torch.BoolTensor [b, n_images, n_tokens]	Yes	True for valid image embedding positions

Outputs

Name	Type	Description
inputs_embeds	torch.Tensor [b, T, D]	Fused text + vision embeddings ready for the language model

Usage Examples

Full Understanding Pipeline

from janus.utils.io import load_pretrained_model, load_pil_images

tokenizer, vl_chat_processor, vl_gpt = load_pretrained_model("deepseek-ai/Janus-1.3B")

conversation = [
    {"role": "User", "content": "<image_placeholder>\nDescribe this.", "images": ["img.png"]},
    {"role": "Assistant", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# Encode images and fuse embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# inputs_embeds shape: [1, T, 2048]
# Image regions now contain SigLIP features projected to LLM dimension

Related Pages

Implements Principle

Principle:Deepseek_ai_Janus_Vision_Encoding_and_Embedding_Fusion

Requires Environment

Environment:Deepseek_ai_Janus_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment