Workflow:Deepseek ai Janus Multimodal Understanding

Knowledge Sources	Janus Janus: Decoupling Visual Encoding Janus-Pro HuggingFace Models
Domains	Multimodal_AI, Vision_Language_Models, Inference
Last Updated	2026-02-10 09:30 GMT

Overview

End-to-end process for performing visual question answering and image understanding using Janus-series unified multimodal models.

Description

This workflow covers the standard procedure for running multimodal understanding inference with any Janus-series model (Janus, Janus-Pro, or JanusFlow). Given an image and a natural language question, the pipeline produces a text answer by encoding the image through a dedicated understanding vision encoder (SigLIP ViT), projecting those embeddings into the language model's space, interleaving them with text token embeddings, and running autoregressive text generation via the shared LlamaForCausalLM backbone.

Key characteristics:

Uses a decoupled visual encoding pathway specifically optimized for understanding tasks
Supports single or multi-image inputs with flexible prompt formatting
Works identically across Janus-1.3B, Janus-Pro-1B, Janus-Pro-7B, and JanusFlow-1.3B

Usage

Execute this workflow when you have one or more images and need the model to answer questions about them, describe visual content, extract information (e.g., OCR, formula recognition), or perform any vision-language understanding task. This is the primary inference path for consuming visual inputs and producing text outputs.

Execution Steps

Step 1: Model and Processor Loading

Load the pretrained multimodal model and its associated chat processor from a HuggingFace model path. The chat processor bundles the tokenizer and image processor. The model is cast to bfloat16 precision, moved to GPU, and set to evaluation mode to disable dropout and gradient tracking.

Key considerations:

Use the matching import path for the model variant: `janus.models` for Janus/Janus-Pro, `janus.janusflow.models` for JanusFlow
The `trust_remote_code=True` flag is required when loading via `AutoModelForCausalLM`
For JanusFlow, the model can be loaded directly via its own `MultiModalityCausalLM.from_pretrained` without the Auto class

Step 2: Conversation Formatting

Structure the input as a conversation list with User and Assistant roles. The user message includes an `<image_placeholder>` token at the position where image embeddings will be injected, followed by the natural language question. The assistant message is left empty to signal the model should generate a response.

Key considerations:

Each image in the input must have a corresponding `<image_placeholder>` in the content string
Multiple images can be referenced in a single turn using multiple placeholders
The conversation format follows DeepSeek's SFT template style with role tags

Step 3: Image Loading and Preprocessing

Load images from file paths or base64-encoded strings into PIL Image objects, then pass them through the VLMImageProcessor. The processor resizes images to the model's expected resolution, applies square padding, rescaling, and normalization to produce pixel value tensors.

Key considerations:

Images are converted to RGB format regardless of input format
The image processor handles resize, center-crop, and normalization in a single pass
Both file paths and base64 data URIs are supported as image sources

Step 4: Input Tokenization and Batching

The VLChatProcessor applies the SFT conversation template, tokenizes the text, inserts special image boundary tokens (begin_of_image, image_placeholder tokens, end_of_image) at each placeholder position, and creates attention and embedding masks. The `force_batchify` option pads sequences and creates proper batch tensors for the model.

Key considerations:

Image token sequences replace each `<image_placeholder>` with 576 image tokens (default)
Left-padding is used for batch alignment
The processor generates `images_seq_mask` and `images_emb_mask` that control where visual embeddings are injected

Step 5: Vision Encoding and Embedding Fusion

Run the model's `prepare_inputs_embeds` method which:

Passes pixel values through the understanding vision encoder (SigLIP ViT wrapped in CLIPVisionTower)
Projects vision features through an alignment layer (MLP projector for Janus/Pro, linear aligner for JanusFlow) to match the language model's embedding dimension
Retrieves text token embeddings from the language model's embedding layer
Replaces the image placeholder positions in the text embedding sequence with the projected visual embeddings

What happens:

The vision encoder produces a sequence of patch embeddings from each image
These embeddings are aligned to the LLM's hidden dimension via a learned projector
The result is a unified embedding sequence where visual and textual information coexist

Step 6: Autoregressive Text Generation

Feed the fused input embeddings into the LlamaForCausalLM backbone's `generate` method to produce output text tokens autoregressively. The generation uses the attention mask from the batching step and standard generation parameters (max_new_tokens, sampling strategy, KV-cache).

Key considerations:

Greedy decoding (`do_sample=False`) is the default for deterministic answers
For more creative responses, enable sampling with temperature and top_p
KV-cache is used to accelerate sequential token generation

Step 7: Response Decoding

Decode the generated token IDs back into a human-readable text string using the tokenizer, with special tokens stripped. The decoded text is the model's answer to the visual question.

Key considerations:

Use `skip_special_tokens=True` to remove control tokens from the output
The SFT format prefix can be printed alongside the answer for context

Execution Diagram

GitHub URL

Workflow Repository