Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepseek ai Janus Vision Encoding and Embedding Fusion

From Leeroopedia


Knowledge Sources
Domains Computer_Vision, Multimodal_AI
Last Updated 2026-02-10 09:30 GMT

Overview

A mechanism for encoding images through a vision transformer, projecting the resulting features into the language model's embedding space, and fusing them with text token embeddings at designated positions.

Description

Vision encoding and embedding fusion is the core step that enables the language model to "see" images. It processes pixel values through a dedicated vision encoder (SigLIP ViT in Janus), projects the vision features to match the language model's hidden dimension via an MLP aligner, and then replaces the placeholder image tokens in the text embedding sequence with the actual vision embeddings.

This follows the decoupled visual encoding principle of Janus: the understanding path uses a SigLIP ViT encoder specialized for visual perception, which is separate from the generation encoder. The aligner (MLP projector) bridges the dimension gap between the vision encoder output and the language model's hidden size.

Usage

Use this principle in the multimodal understanding pipeline, after tokenization/batching and before autoregressive text generation. It converts the raw pixel values and token IDs into a single unified embedding sequence that the language model can process.

Theoretical Basis

The fusion pipeline operates in four steps:

  1. Vision encoding: Images are passed through the SigLIP ViT to produce patch-level features
    Fvis=ViT(I)n×T2×Dvis
  2. Alignment projection: An MLP projector maps vision features to language dimension
    Evis=Aligner(Fvis)n×T2×Dlang
  3. Text embedding: Token IDs are converted to text embeddings via the LLM's embedding layer
    Failed to parse (syntax error): {\displaystyle E_{text} = \text{Embed}(\text{input\_ids}) \in \mathbb{R}^{b \times T \times D_{lang}}}
  4. Fusion: Vision embeddings replace text embeddings at image token positions using boolean masks
    Failed to parse (syntax error): {\displaystyle E_{text}[\text{images\_seq\_mask}] = E_{vis}[\text{images\_emb\_mask}]}

The result is a single embedding tensor where image regions contain vision features and text regions contain token embeddings.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment