Workflow:Deepseek ai Janus Autoregressive Image Generation
| Knowledge Sources | |
|---|---|
| Domains | Multimodal_AI, Image_Generation, VQ_VAE, Inference |
| Last Updated | 2026-02-10 09:30 GMT |
Overview
End-to-end process for generating images from text prompts using the Janus/Janus-Pro autoregressive image generation pipeline with VQ-VAE decoding and classifier-free guidance.
Description
This workflow covers text-to-image generation using the Janus and Janus-Pro model variants. The generation pathway uses a dedicated visual generation head and VQ-VAE tokenizer, which is fully decoupled from the understanding encoder. A text prompt is formatted, tokenized, and fed through the LLM backbone which autoregressively predicts discrete image tokens one at a time. Classifier-free guidance (CFG) is applied by running paired conditional and unconditional forward passes. The generated token sequence is then decoded by the VQ-VAE into pixel space.
Key characteristics:
- Autoregressive token-by-token image generation through the shared LLM backbone
- VQ-VAE maps between discrete codebook indices and continuous pixel values
- Classifier-free guidance improves prompt adherence by combining conditional and unconditional logits
- Supports parallel generation of multiple images in a single batch
Usage
Execute this workflow when you need to generate images from text descriptions using Janus-1.3B, Janus-Pro-1B, or Janus-Pro-7B. This is the appropriate pipeline when using the autoregressive (non-flow) model variants and you want to produce one or more images matching a natural language prompt.
Execution Steps
Step 1: Model and Processor Loading
Load the pretrained Janus or Janus-Pro model and its VLChatProcessor from HuggingFace. The model includes the generation-specific components: a generation embedding layer (`gen_embed`), a generation head (`gen_head`) that maps LLM hidden states to VQ codebook logits, a generation aligner (`gen_aligner`), and the VQ-VAE decoder (`gen_vision_model`). Cast to bfloat16, move to GPU, and set to evaluation mode.
Key considerations:
- This workflow uses the `janus.models` import path (not `janus.janusflow.models`)
- The VQ-VAE model is initialized from the config's `gen_vision_config` section
- Both model and processor must come from the same model checkpoint
Step 2: Prompt Formatting
Structure the text prompt as a conversation with a User message containing the image description and an empty Assistant message. Apply the SFT template to format the conversation, then append the `image_start_tag` token to signal the beginning of image generation.
Key considerations:
- The system prompt is set to empty string for generation tasks
- The `image_start_tag` (e.g., `<begin_of_image>`) must be appended after the formatted prompt
- This differs from understanding where image placeholders go inside the user message
Step 3: Input Preparation for Classifier Free Guidance
Encode the prompt into token IDs and create a batch with paired conditional and unconditional sequences for CFG. The conditional sequences contain the full prompt tokens, while the unconditional sequences have all interior tokens replaced with the pad token (retaining only the BOS and the image_start tokens). This interleaved batch (conditional at even indices, unconditional at odd indices) is embedded through the language model's input embedding layer.
Key considerations:
- The batch size is `parallel_size * 2` to accommodate both conditional and unconditional paths
- Unconditional sequences mask out the prompt content by replacing with pad_id
- The `parallel_size` parameter controls how many images are generated simultaneously (default: 16)
Step 4: Autoregressive Token Generation Loop
Iterate for `image_token_num_per_image` steps (default: 576, representing a 24x24 grid of tokens). At each step:
- Run the LLM backbone forward pass on the current embeddings using KV-cache for efficiency
- Extract the last hidden state and pass it through the generation head to get logits over the VQ codebook
- Apply classifier-free guidance by combining conditional and unconditional logits: `logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)`
- Sample the next token from the softmax distribution (with temperature scaling)
- Map the sampled token through `prepare_gen_img_embeds` to get the embedding for the next iteration
What happens:
- Each iteration produces one discrete VQ token per image in the batch
- The generation head projects LLM hidden states to the VQ-VAE codebook vocabulary
- CFG weight (default: 5.0) controls fidelity vs. diversity tradeoff
- Temperature controls sampling randomness
Step 5: VQ VAE Decoding
Decode the full sequence of generated VQ tokens (576 per image) back into pixel space using the VQ-VAE model's `decode_code` method. The tokens are reshaped to the spatial grid layout and passed through the VQ-VAE decoder which maps codebook indices to continuous feature maps and reconstructs the image.
Key considerations:
- The decode shape is `[batch, 8, img_size//patch_size, img_size//patch_size]` (8 is the codebook embedding channels)
- Default output is 384x384 pixels with 16x16 patches (24x24 spatial grid)
- The decoded values are in [-1, 1] range and must be rescaled to [0, 255]
Step 6: Post Processing and Saving
Convert the decoded tensor from the [-1, 1] range to [0, 255] uint8 pixel values. Transpose from channel-first to channel-last format (HWC). Create PIL Image objects from the numpy arrays and save to disk as JPEG files.
Key considerations:
- Values are clipped to valid range before casting to uint8
- Each image in the parallel batch is saved as a separate file
- Images can optionally be upscaled from 384x384 to higher resolutions (e.g., 1024x1024 via Lanczos resampling in the Gradio demo)