Principle:Deepseek ai Janus Vision Encoding and Embedding Fusion
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Multimodal_AI |
| Last Updated | 2026-02-10 09:30 GMT |
Overview
A mechanism for encoding images through a vision transformer, projecting the resulting features into the language model's embedding space, and fusing them with text token embeddings at designated positions.
Description
Vision encoding and embedding fusion is the core step that enables the language model to "see" images. It processes pixel values through a dedicated vision encoder (SigLIP ViT in Janus), projects the vision features to match the language model's hidden dimension via an MLP aligner, and then replaces the placeholder image tokens in the text embedding sequence with the actual vision embeddings.
This follows the decoupled visual encoding principle of Janus: the understanding path uses a SigLIP ViT encoder specialized for visual perception, which is separate from the generation encoder. The aligner (MLP projector) bridges the dimension gap between the vision encoder output and the language model's hidden size.
Usage
Use this principle in the multimodal understanding pipeline, after tokenization/batching and before autoregressive text generation. It converts the raw pixel values and token IDs into a single unified embedding sequence that the language model can process.
Theoretical Basis
The fusion pipeline operates in four steps:
- Vision encoding: Images are passed through the SigLIP ViT to produce patch-level features
- Alignment projection: An MLP projector maps vision features to language dimension
- Text embedding: Token IDs are converted to text embeddings via the LLM's embedding layer
- Failed to parse (syntax error): {\displaystyle E_{text} = \text{Embed}(\text{input\_ids}) \in \mathbb{R}^{b \times T \times D_{lang}}}
- Fusion: Vision embeddings replace text embeddings at image token positions using boolean masks
- Failed to parse (syntax error): {\displaystyle E_{text}[\text{images\_seq\_mask}] = E_{vis}[\text{images\_emb\_mask}]}
The result is a single embedding tensor where image regions contain vision features and text regions contain token embeddings.