Workflow:Deepseek ai Janus Rectified Flow Image Generation
| Knowledge Sources | |
|---|---|
| Domains | Multimodal_AI, Image_Generation, Rectified_Flow, Inference |
| Last Updated | 2026-02-10 09:30 GMT |
Overview
End-to-end process for generating images from text prompts using the JanusFlow rectified flow pipeline with ODE-based denoising and SDXL VAE decoding.
Description
This workflow covers text-to-image generation using the JanusFlow model variant. Unlike the autoregressive VQ-VAE approach in Janus/Janus-Pro, JanusFlow uses rectified flow (a continuous generative modeling method) integrated directly into the LLM framework. Starting from Gaussian noise in a latent space, the process iteratively denoises through an ODE solver, where each step encodes the current latent through a ShallowUViT encoder, passes it through the LLM backbone for guidance, and decodes the velocity prediction through a ShallowUViT decoder. The final denoised latent is decoded to pixels via an external SDXL VAE.
Key characteristics:
- Continuous flow-based generation rather than discrete token sampling
- Uses ShallowUViT encoder/decoder pair for latent-to-embedding and embedding-to-velocity transformations
- Requires an external SDXL VAE (from Stability AI) for final pixel decoding
- Classifier-free guidance operates in the continuous velocity space
- Produces higher-quality images with fewer steps compared to the autoregressive approach
Usage
Execute this workflow when you need to generate images from text descriptions using the JanusFlow-1.3B model. This pipeline is preferred when you want flow-based generation quality, controllable inference steps (typically 30), and the ability to use the same model for both understanding and generation tasks with a rectified flow approach.
Execution Steps
Step 1: Model and Processor Loading
Load the pretrained JanusFlow model and its VLChatProcessor. The JanusFlow model includes flow-specific components: a ShallowUViT encoder (`vision_gen_enc_model`), a ShallowUViT decoder (`vision_gen_dec_model`), linear aligners for bridging the UViT and LLM embedding spaces, and a RMS normalization layer. Additionally, load the external SDXL VAE from Stability AI for final pixel decoding. All models are cast to bfloat16 and moved to GPU.
Key considerations:
- Use the `janus.janusflow.models` import path for JanusFlow
- The SDXL VAE must be loaded separately from `stabilityai/sdxl-vae` via `diffusers.models.AutoencoderKL`
- The VAE specifically requires bfloat16 precision (fp16 does not work correctly)
- Both the JanusFlow model and the VAE must be on the same device
Step 2: Prompt Formatting
Structure the text prompt as a conversation and apply the SFT template. Append the `image_gen_tag` (note: JanusFlow uses `image_gen_tag` rather than `image_start_tag`) to signal the start of image generation. Tokenize the formatted prompt into input IDs.
Key considerations:
- JanusFlow uses `image_gen_tag` property instead of `image_start_tag`
- The system prompt is set to empty string for generation
- The formatted prompt ends with a special begin-of-generation token
Step 3: Input Preparation for Classifier Free Guidance
Create a batch of duplicated token sequences for conditional and unconditional CFG paths. The first half contains the full prompt embeddings (conditional) and the second half has prompt tokens masked with pad_id (unconditional). Embed through the language model's input embedding layer and remove the last token position (the begin-of-generation token) since it will be replaced by a timestep embedding in the ODE loop.
Key considerations:
- Batch size is `batchsize * 2` for CFG (default batchsize: 5)
- The last token is trimmed because the timestep embedding replaces the begin-of-generation token
- An attention mask is constructed where unconditional sequences have zeros for prompt positions
Step 4: Noise Initialization
Sample Gaussian noise in the latent space with shape `[batchsize, 4, 48, 48]` matching the SDXL VAE's latent resolution. Compute the fixed timestep increment `dt = 1.0 / num_inference_steps` as a tensor for the Euler ODE solver.
Key considerations:
- The latent shape `[B, 4, 48, 48]` corresponds to the VAE's 8x downsampling from 384x384 pixel space
- The 4 channels match the SDXL VAE's latent dimension
- The noise is sampled in bfloat16 precision
Step 5: ODE Denoising Loop
Iterate for `num_inference_steps` (default: 30) to progressively denoise the latent from pure noise to a clean sample. At each step:
- Encode: Pass the current latent (duplicated for CFG) and the current timestep through the ShallowUViT encoder, producing spatial embeddings, a timestep embedding, and hidden states
- Align: Reshape and project the spatial embeddings through the generation encoder aligner to match the LLM's hidden dimension
- LLM Forward: Concatenate the prompt embeddings, timestep embedding, and aligned latent embeddings, then run through the LLM backbone with KV-cache (text prompt portion cached after the first step)
- Decode: Extract the last 576 positions from the LLM hidden states, normalize with RMSNorm, project through the decoder aligner, reshape to spatial format, and pass through the ShallowUViT decoder along with the encoder hidden states and timestep embedding to predict the velocity field
- CFG: Combine conditional and unconditional velocity predictions: `v = cfg_weight * v_cond - (cfg_weight - 1) * v_uncond`
- Euler Step: Update the latent: `z = z + dt * v`
What happens:
- The LLM backbone serves as the core denoising network, guided by text context
- KV-cache is used to avoid recomputing prompt embeddings at each step
- The ShallowUViT encoder/decoder handle the latent-to-LLM and LLM-to-velocity transformations
- Each step moves the latent along the learned flow trajectory from noise toward the data distribution
Step 6: VAE Decoding
Pass the final denoised latent through the SDXL VAE decoder to produce pixel-space images. The latent is first divided by the VAE's scaling factor before decoding.
Key considerations:
- The VAE scaling factor is read from `vae.config.scaling_factor`
- The output is in [-1, 1] range and needs rescaling for display
- The VAE operates on the full batch simultaneously
Step 7: Post Processing and Saving
Clip the decoded pixel values to [-1, 1], rescale to [0, 1] for saving, and convert to uint8 format. Save the generated images using `torchvision.utils.save_image` or convert to PIL Images for display.
Key considerations:
- Images are produced at the VAE's native resolution (384x384 by default)
- The Gradio demo upscales to 1024x1024 via Lanczos resampling for display
- Multiple images from the batch can be saved individually