Principle:Deepseek ai Janus Vision Encoding and Embedding Fusion

Knowledge Sources	Janus: Decoupling Visual Encoding SigLIP: Sigmoid Loss for Language Image Pre-Training
Domains	Computer_Vision, Multimodal_AI
Last Updated	2026-02-10 09:30 GMT

Overview

A mechanism for encoding images through a vision transformer, projecting the resulting features into the language model's embedding space, and fusing them with text token embeddings at designated positions.

Description

Vision encoding and embedding fusion is the core step that enables the language model to "see" images. It processes pixel values through a dedicated vision encoder (SigLIP ViT in Janus), projects the vision features to match the language model's hidden dimension via an MLP aligner, and then replaces the placeholder image tokens in the text embedding sequence with the actual vision embeddings.

This follows the decoupled visual encoding principle of Janus: the understanding path uses a SigLIP ViT encoder specialized for visual perception, which is separate from the generation encoder. The aligner (MLP projector) bridges the dimension gap between the vision encoder output and the language model's hidden size.

Usage

Use this principle in the multimodal understanding pipeline, after tokenization/batching and before autoregressive text generation. It converts the raw pixel values and token IDs into a single unified embedding sequence that the language model can process.

Theoretical Basis

The fusion pipeline operates in four steps:

Vision encoding: Images are passed through the SigLIP ViT to produce patch-level features
$F_{v i s} = ViT (I) \in ℝ^{n \times T_{2} \times D_{v i s}}$
Alignment projection: An MLP projector maps vision features to language dimension
$E_{v i s} = Aligner (F_{v i s}) \in ℝ^{n \times T_{2} \times D_{l a n g}}$
Text embedding: Token IDs are converted to text embeddings via the LLM's embedding layer
Failed to parse (syntax error): {\displaystyle E_{text} = \text{Embed}(\text{input\_ids}) \in \mathbb{R}^{b \times T \times D_{lang}}}
Fusion: Vision embeddings replace text embeddings at image token positions using boolean masks
Failed to parse (syntax error): {\displaystyle E_{text}[\text{images\_seq\_mask}] = E_{vis}[\text{images\_emb\_mask}]}

The result is a single embedding tensor where image regions contain vision features and text regions contain token embeddings.

Related Pages

Implemented By

Implementation:Deepseek_ai_Janus_Prepare_Inputs_Embeds

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment