Implementation:Deepseek ai Janus Prepare Inputs Embeds
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Multimodal_AI |
| Last Updated | 2026-02-10 09:30 GMT |
Overview
Concrete tool for encoding images and fusing vision embeddings with text embeddings provided by the Janus MultiModalityCausalLM model.
Description
The MultiModalityCausalLM.prepare_inputs_embeds method takes tokenized input IDs and preprocessed pixel values, runs the vision model (CLIPVisionTower wrapping SigLIP ViT) and MLP aligner, then replaces image token positions in the text embedding with the projected vision features.
Internally, the method uses einops.rearrange to handle batch and image dimensions and boolean masks to perform the embedding replacement.
Usage
Call this method after obtaining the BatchedVLChatProcessorOutput from the VLChatProcessor. The returned inputs_embeds tensor is passed directly to language_model.generate() for text generation.
Code Reference
Source Location
- Repository: Janus
- File: janus/models/modeling_vlm.py
- Lines: L221-260
Signature
class MultiModalityCausalLM(MultiModalityPreTrainedModel):
def prepare_inputs_embeds(
self,
input_ids: torch.LongTensor, # [b, T]
pixel_values: torch.FloatTensor, # [b, n_images, 3, h, w]
images_seq_mask: torch.LongTensor, # [b, T]
images_emb_mask: torch.LongTensor, # [b, n_images, n_image_tokens]
**kwargs,
) -> torch.Tensor:
"""
Encode images and fuse with text embeddings.
Returns:
inputs_embeds: torch.Tensor [b, T, D] — fused embeddings
"""
Import
from janus.models import MultiModalityCausalLM
# Method is called on the model instance: vl_gpt.prepare_inputs_embeds(...)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.LongTensor [b, T] | Yes | Tokenized input IDs with image placeholder tokens |
| pixel_values | torch.FloatTensor [b, n_images, 3, h, w] | Yes | Preprocessed image tensors |
| images_seq_mask | torch.BoolTensor [b, T] | Yes | True at positions corresponding to image tokens |
| images_emb_mask | torch.BoolTensor [b, n_images, n_tokens] | Yes | True for valid image embedding positions |
Outputs
| Name | Type | Description |
|---|---|---|
| inputs_embeds | torch.Tensor [b, T, D] | Fused text + vision embeddings ready for the language model |
Usage Examples
Full Understanding Pipeline
from janus.utils.io import load_pretrained_model, load_pil_images
tokenizer, vl_chat_processor, vl_gpt = load_pretrained_model("deepseek-ai/Janus-1.3B")
conversation = [
{"role": "User", "content": "<image_placeholder>\nDescribe this.", "images": ["img.png"]},
{"role": "Assistant", "content": ""},
]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# Encode images and fuse embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# inputs_embeds shape: [1, T, 2048]
# Image regions now contain SigLIP features projected to LLM dimension