Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepseek ai Janus Prepare Inputs Embeds

From Leeroopedia


Knowledge Sources
Domains Computer_Vision, Multimodal_AI
Last Updated 2026-02-10 09:30 GMT

Overview

Concrete tool for encoding images and fusing vision embeddings with text embeddings provided by the Janus MultiModalityCausalLM model.

Description

The MultiModalityCausalLM.prepare_inputs_embeds method takes tokenized input IDs and preprocessed pixel values, runs the vision model (CLIPVisionTower wrapping SigLIP ViT) and MLP aligner, then replaces image token positions in the text embedding with the projected vision features.

Internally, the method uses einops.rearrange to handle batch and image dimensions and boolean masks to perform the embedding replacement.

Usage

Call this method after obtaining the BatchedVLChatProcessorOutput from the VLChatProcessor. The returned inputs_embeds tensor is passed directly to language_model.generate() for text generation.

Code Reference

Source Location

  • Repository: Janus
  • File: janus/models/modeling_vlm.py
  • Lines: L221-260

Signature

class MultiModalityCausalLM(MultiModalityPreTrainedModel):
    def prepare_inputs_embeds(
        self,
        input_ids: torch.LongTensor,       # [b, T]
        pixel_values: torch.FloatTensor,    # [b, n_images, 3, h, w]
        images_seq_mask: torch.LongTensor,  # [b, T]
        images_emb_mask: torch.LongTensor,  # [b, n_images, n_image_tokens]
        **kwargs,
    ) -> torch.Tensor:
        """
        Encode images and fuse with text embeddings.

        Returns:
            inputs_embeds: torch.Tensor [b, T, D] — fused embeddings
        """

Import

from janus.models import MultiModalityCausalLM
# Method is called on the model instance: vl_gpt.prepare_inputs_embeds(...)

I/O Contract

Inputs

Name Type Required Description
input_ids torch.LongTensor [b, T] Yes Tokenized input IDs with image placeholder tokens
pixel_values torch.FloatTensor [b, n_images, 3, h, w] Yes Preprocessed image tensors
images_seq_mask torch.BoolTensor [b, T] Yes True at positions corresponding to image tokens
images_emb_mask torch.BoolTensor [b, n_images, n_tokens] Yes True for valid image embedding positions

Outputs

Name Type Description
inputs_embeds torch.Tensor [b, T, D] Fused text + vision embeddings ready for the language model

Usage Examples

Full Understanding Pipeline

from janus.utils.io import load_pretrained_model, load_pil_images

tokenizer, vl_chat_processor, vl_gpt = load_pretrained_model("deepseek-ai/Janus-1.3B")

conversation = [
    {"role": "User", "content": "<image_placeholder>\nDescribe this.", "images": ["img.png"]},
    {"role": "Assistant", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# Encode images and fuse embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# inputs_embeds shape: [1, T, 2048]
# Image regions now contain SigLIP features projected to LLM dimension

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment