Workflow:Deepseek ai Janus Autoregressive Image Generation

Knowledge Sources	Janus Janus: Decoupling Visual Encoding Janus-Pro HuggingFace Models
Domains	Multimodal_AI, Image_Generation, VQ_VAE, Inference
Last Updated	2026-02-10 09:30 GMT

Overview

End-to-end process for generating images from text prompts using the Janus/Janus-Pro autoregressive image generation pipeline with VQ-VAE decoding and classifier-free guidance.

Description

This workflow covers text-to-image generation using the Janus and Janus-Pro model variants. The generation pathway uses a dedicated visual generation head and VQ-VAE tokenizer, which is fully decoupled from the understanding encoder. A text prompt is formatted, tokenized, and fed through the LLM backbone which autoregressively predicts discrete image tokens one at a time. Classifier-free guidance (CFG) is applied by running paired conditional and unconditional forward passes. The generated token sequence is then decoded by the VQ-VAE into pixel space.

Key characteristics:

Autoregressive token-by-token image generation through the shared LLM backbone
VQ-VAE maps between discrete codebook indices and continuous pixel values
Classifier-free guidance improves prompt adherence by combining conditional and unconditional logits
Supports parallel generation of multiple images in a single batch

Usage

Execute this workflow when you need to generate images from text descriptions using Janus-1.3B, Janus-Pro-1B, or Janus-Pro-7B. This is the appropriate pipeline when using the autoregressive (non-flow) model variants and you want to produce one or more images matching a natural language prompt.

Execution Steps

Step 1: Model and Processor Loading

Load the pretrained Janus or Janus-Pro model and its VLChatProcessor from HuggingFace. The model includes the generation-specific components: a generation embedding layer (`gen_embed`), a generation head (`gen_head`) that maps LLM hidden states to VQ codebook logits, a generation aligner (`gen_aligner`), and the VQ-VAE decoder (`gen_vision_model`). Cast to bfloat16, move to GPU, and set to evaluation mode.

Key considerations:

This workflow uses the `janus.models` import path (not `janus.janusflow.models`)
The VQ-VAE model is initialized from the config's `gen_vision_config` section
Both model and processor must come from the same model checkpoint

Step 2: Prompt Formatting

Structure the text prompt as a conversation with a User message containing the image description and an empty Assistant message. Apply the SFT template to format the conversation, then append the `image_start_tag` token to signal the beginning of image generation.

Key considerations:

The system prompt is set to empty string for generation tasks
The `image_start_tag` (e.g., `<begin_of_image>`) must be appended after the formatted prompt
This differs from understanding where image placeholders go inside the user message

Step 3: Input Preparation for Classifier Free Guidance

Encode the prompt into token IDs and create a batch with paired conditional and unconditional sequences for CFG. The conditional sequences contain the full prompt tokens, while the unconditional sequences have all interior tokens replaced with the pad token (retaining only the BOS and the image_start tokens). This interleaved batch (conditional at even indices, unconditional at odd indices) is embedded through the language model's input embedding layer.

Key considerations:

The batch size is `parallel_size * 2` to accommodate both conditional and unconditional paths
Unconditional sequences mask out the prompt content by replacing with pad_id
The `parallel_size` parameter controls how many images are generated simultaneously (default: 16)

Step 4: Autoregressive Token Generation Loop

Iterate for `image_token_num_per_image` steps (default: 576, representing a 24x24 grid of tokens). At each step:

Run the LLM backbone forward pass on the current embeddings using KV-cache for efficiency
Extract the last hidden state and pass it through the generation head to get logits over the VQ codebook
Apply classifier-free guidance by combining conditional and unconditional logits: `logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)`
Sample the next token from the softmax distribution (with temperature scaling)
Map the sampled token through `prepare_gen_img_embeds` to get the embedding for the next iteration

What happens:

Each iteration produces one discrete VQ token per image in the batch
The generation head projects LLM hidden states to the VQ-VAE codebook vocabulary
CFG weight (default: 5.0) controls fidelity vs. diversity tradeoff
Temperature controls sampling randomness

Step 5: VQ VAE Decoding

Decode the full sequence of generated VQ tokens (576 per image) back into pixel space using the VQ-VAE model's `decode_code` method. The tokens are reshaped to the spatial grid layout and passed through the VQ-VAE decoder which maps codebook indices to continuous feature maps and reconstructs the image.

Key considerations:

The decode shape is `[batch, 8, img_size//patch_size, img_size//patch_size]` (8 is the codebook embedding channels)
Default output is 384x384 pixels with 16x16 patches (24x24 spatial grid)
The decoded values are in [-1, 1] range and must be rescaled to [0, 255]

Step 6: Post Processing and Saving

Convert the decoded tensor from the [-1, 1] range to [0, 255] uint8 pixel values. Transpose from channel-first to channel-last format (HWC). Create PIL Image objects from the numpy arrays and save to disk as JPEG files.

Key considerations:

Values are clipped to valid range before casting to uint8
Each image in the parallel batch is saved as a separate file
Images can optionally be upscaled from 384x384 to higher resolutions (e.g., 1024x1024 via Lanczos resampling in the Gradio demo)

Execution Diagram

GitHub URL

Workflow Repository