Workflow:Deepseek ai Janus Multimodal Understanding
| Knowledge Sources | |
|---|---|
| Domains | Multimodal_AI, Vision_Language_Models, Inference |
| Last Updated | 2026-02-10 09:30 GMT |
Overview
End-to-end process for performing visual question answering and image understanding using Janus-series unified multimodal models.
Description
This workflow covers the standard procedure for running multimodal understanding inference with any Janus-series model (Janus, Janus-Pro, or JanusFlow). Given an image and a natural language question, the pipeline produces a text answer by encoding the image through a dedicated understanding vision encoder (SigLIP ViT), projecting those embeddings into the language model's space, interleaving them with text token embeddings, and running autoregressive text generation via the shared LlamaForCausalLM backbone.
Key characteristics:
- Uses a decoupled visual encoding pathway specifically optimized for understanding tasks
- Supports single or multi-image inputs with flexible prompt formatting
- Works identically across Janus-1.3B, Janus-Pro-1B, Janus-Pro-7B, and JanusFlow-1.3B
Usage
Execute this workflow when you have one or more images and need the model to answer questions about them, describe visual content, extract information (e.g., OCR, formula recognition), or perform any vision-language understanding task. This is the primary inference path for consuming visual inputs and producing text outputs.
Execution Steps
Step 1: Model and Processor Loading
Load the pretrained multimodal model and its associated chat processor from a HuggingFace model path. The chat processor bundles the tokenizer and image processor. The model is cast to bfloat16 precision, moved to GPU, and set to evaluation mode to disable dropout and gradient tracking.
Key considerations:
- Use the matching import path for the model variant: `janus.models` for Janus/Janus-Pro, `janus.janusflow.models` for JanusFlow
- The `trust_remote_code=True` flag is required when loading via `AutoModelForCausalLM`
- For JanusFlow, the model can be loaded directly via its own `MultiModalityCausalLM.from_pretrained` without the Auto class
Step 2: Conversation Formatting
Structure the input as a conversation list with User and Assistant roles. The user message includes an `<image_placeholder>` token at the position where image embeddings will be injected, followed by the natural language question. The assistant message is left empty to signal the model should generate a response.
Key considerations:
- Each image in the input must have a corresponding `<image_placeholder>` in the content string
- Multiple images can be referenced in a single turn using multiple placeholders
- The conversation format follows DeepSeek's SFT template style with role tags
Step 3: Image Loading and Preprocessing
Load images from file paths or base64-encoded strings into PIL Image objects, then pass them through the VLMImageProcessor. The processor resizes images to the model's expected resolution, applies square padding, rescaling, and normalization to produce pixel value tensors.
Key considerations:
- Images are converted to RGB format regardless of input format
- The image processor handles resize, center-crop, and normalization in a single pass
- Both file paths and base64 data URIs are supported as image sources
Step 4: Input Tokenization and Batching
The VLChatProcessor applies the SFT conversation template, tokenizes the text, inserts special image boundary tokens (begin_of_image, image_placeholder tokens, end_of_image) at each placeholder position, and creates attention and embedding masks. The `force_batchify` option pads sequences and creates proper batch tensors for the model.
Key considerations:
- Image token sequences replace each `<image_placeholder>` with 576 image tokens (default)
- Left-padding is used for batch alignment
- The processor generates `images_seq_mask` and `images_emb_mask` that control where visual embeddings are injected
Step 5: Vision Encoding and Embedding Fusion
Run the model's `prepare_inputs_embeds` method which:
- Passes pixel values through the understanding vision encoder (SigLIP ViT wrapped in CLIPVisionTower)
- Projects vision features through an alignment layer (MLP projector for Janus/Pro, linear aligner for JanusFlow) to match the language model's embedding dimension
- Retrieves text token embeddings from the language model's embedding layer
- Replaces the image placeholder positions in the text embedding sequence with the projected visual embeddings
What happens:
- The vision encoder produces a sequence of patch embeddings from each image
- These embeddings are aligned to the LLM's hidden dimension via a learned projector
- The result is a unified embedding sequence where visual and textual information coexist
Step 6: Autoregressive Text Generation
Feed the fused input embeddings into the LlamaForCausalLM backbone's `generate` method to produce output text tokens autoregressively. The generation uses the attention mask from the batching step and standard generation parameters (max_new_tokens, sampling strategy, KV-cache).
Key considerations:
- Greedy decoding (`do_sample=False`) is the default for deterministic answers
- For more creative responses, enable sampling with temperature and top_p
- KV-cache is used to accelerate sequential token generation
Step 7: Response Decoding
Decode the generated token IDs back into a human-readable text string using the tokenizer, with special tokens stripped. The decoded text is the model's answer to the visual question.
Key considerations:
- Use `skip_special_tokens=True` to remove control tokens from the output
- The SFT format prefix can be printed alongside the answer for context