Principle:Deepseek ai Janus Autoregressive VQ Token Generation
| Knowledge Sources | |
|---|---|
| Domains | Image_Generation, Autoregressive_Models |
| Last Updated | 2026-02-10 09:30 GMT |
Overview
A loop-based procedure for generating discrete VQ codebook indices one token at a time using an LLM with a generation head, guided by classifier-free guidance.
Description
Autoregressive VQ token generation is the core image generation mechanism in Janus. Rather than generating continuous pixel values, the model generates a sequence of discrete codebook indices from a VQ-VAE vocabulary. Each token represents a spatial patch of the output image.
The generation loop runs for a fixed number of steps (576 tokens for a 384×384 image with 16×16 patches). At each step:
- The LLM backbone produces hidden states from the current embeddings
- The gen_head (a 2-layer MLP) projects hidden states to VQ codebook logits
- CFG combines conditional and unconditional logits
- A token is sampled from the resulting distribution
- The sampled token is converted to an embedding via prepare_gen_img_embeds for the next step
Usage
Use this principle after CFG input preparation. The output is a tensor of VQ codebook indices that must be decoded by the VQ-VAE to produce pixel images.
Theoretical Basis
The autoregressive factorization for image tokens:
Where z_i are VQ codebook indices, c is the text condition, and N=576 (24×24 spatial grid).
At each step, classifier-free guidance adjusts the logits:
The gen_head architecture is: Linear(D → D_img) → GELU → Linear(D_img → codebook_size)
The prepare_gen_img_embeds converts sampled tokens back to embeddings: Embedding(token) → gen_aligner(gen_embed(token))