Principle:Deepseek ai Janus Autoregressive VQ Token Generation

Knowledge Sources	Janus: Decoupling Visual Encoding Neural Discrete Representation Learning (VQ-VAE)
Domains	Image_Generation, Autoregressive_Models
Last Updated	2026-02-10 09:30 GMT

Overview

A loop-based procedure for generating discrete VQ codebook indices one token at a time using an LLM with a generation head, guided by classifier-free guidance.

Description

Autoregressive VQ token generation is the core image generation mechanism in Janus. Rather than generating continuous pixel values, the model generates a sequence of discrete codebook indices from a VQ-VAE vocabulary. Each token represents a spatial patch of the output image.

The generation loop runs for a fixed number of steps (576 tokens for a 384×384 image with 16×16 patches). At each step:

The LLM backbone produces hidden states from the current embeddings
The gen_head (a 2-layer MLP) projects hidden states to VQ codebook logits
CFG combines conditional and unconditional logits
A token is sampled from the resulting distribution
The sampled token is converted to an embedding via prepare_gen_img_embeds for the next step

Usage

Use this principle after CFG input preparation. The output is a tensor of VQ codebook indices that must be decoded by the VQ-VAE to produce pixel images.

Theoretical Basis

The autoregressive factorization for image tokens:

$P (z_{1}, . . ., z_{N}) = \prod_{i = 1}^{N} P (z_{i} | z_{< i}, c)$

Where z_i are VQ codebook indices, c is the text condition, and N=576 (24×24 spatial grid).

At each step, classifier-free guidance adjusts the logits:

$logits = {logits}_{u n c o n d} + w \cdot ({logits}_{c o n d} - {logits}_{u n c o n d})$

The gen_head architecture is: Linear(D → D_img) → GELU → Linear(D_img → codebook_size)

The prepare_gen_img_embeds converts sampled tokens back to embeddings: Embedding(token) → gen_aligner(gen_embed(token))

Related Pages

Implemented By

Implementation:Deepseek_ai_Janus_AR_Token_Generation_Loop

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment