Workflow:Deepseek ai Janus Rectified Flow Image Generation

Knowledge Sources	Janus JanusFlow HuggingFace Models Diffusers AutoencoderKL
Domains	Multimodal_AI, Image_Generation, Rectified_Flow, Inference
Last Updated	2026-02-10 09:30 GMT

Overview

End-to-end process for generating images from text prompts using the JanusFlow rectified flow pipeline with ODE-based denoising and SDXL VAE decoding.

Description

This workflow covers text-to-image generation using the JanusFlow model variant. Unlike the autoregressive VQ-VAE approach in Janus/Janus-Pro, JanusFlow uses rectified flow (a continuous generative modeling method) integrated directly into the LLM framework. Starting from Gaussian noise in a latent space, the process iteratively denoises through an ODE solver, where each step encodes the current latent through a ShallowUViT encoder, passes it through the LLM backbone for guidance, and decodes the velocity prediction through a ShallowUViT decoder. The final denoised latent is decoded to pixels via an external SDXL VAE.

Key characteristics:

Continuous flow-based generation rather than discrete token sampling
Uses ShallowUViT encoder/decoder pair for latent-to-embedding and embedding-to-velocity transformations
Requires an external SDXL VAE (from Stability AI) for final pixel decoding
Classifier-free guidance operates in the continuous velocity space
Produces higher-quality images with fewer steps compared to the autoregressive approach

Usage

Execute this workflow when you need to generate images from text descriptions using the JanusFlow-1.3B model. This pipeline is preferred when you want flow-based generation quality, controllable inference steps (typically 30), and the ability to use the same model for both understanding and generation tasks with a rectified flow approach.

Execution Steps

Step 1: Model and Processor Loading

Load the pretrained JanusFlow model and its VLChatProcessor. The JanusFlow model includes flow-specific components: a ShallowUViT encoder (`vision_gen_enc_model`), a ShallowUViT decoder (`vision_gen_dec_model`), linear aligners for bridging the UViT and LLM embedding spaces, and a RMS normalization layer. Additionally, load the external SDXL VAE from Stability AI for final pixel decoding. All models are cast to bfloat16 and moved to GPU.

Key considerations:

Use the `janus.janusflow.models` import path for JanusFlow
The SDXL VAE must be loaded separately from `stabilityai/sdxl-vae` via `diffusers.models.AutoencoderKL`
The VAE specifically requires bfloat16 precision (fp16 does not work correctly)
Both the JanusFlow model and the VAE must be on the same device

Step 2: Prompt Formatting

Structure the text prompt as a conversation and apply the SFT template. Append the `image_gen_tag` (note: JanusFlow uses `image_gen_tag` rather than `image_start_tag`) to signal the start of image generation. Tokenize the formatted prompt into input IDs.

Key considerations:

JanusFlow uses `image_gen_tag` property instead of `image_start_tag`
The system prompt is set to empty string for generation
The formatted prompt ends with a special begin-of-generation token

Step 3: Input Preparation for Classifier Free Guidance

Create a batch of duplicated token sequences for conditional and unconditional CFG paths. The first half contains the full prompt embeddings (conditional) and the second half has prompt tokens masked with pad_id (unconditional). Embed through the language model's input embedding layer and remove the last token position (the begin-of-generation token) since it will be replaced by a timestep embedding in the ODE loop.

Key considerations:

Batch size is `batchsize * 2` for CFG (default batchsize: 5)
The last token is trimmed because the timestep embedding replaces the begin-of-generation token
An attention mask is constructed where unconditional sequences have zeros for prompt positions

Step 4: Noise Initialization

Sample Gaussian noise in the latent space with shape `[batchsize, 4, 48, 48]` matching the SDXL VAE's latent resolution. Compute the fixed timestep increment `dt = 1.0 / num_inference_steps` as a tensor for the Euler ODE solver.

Key considerations:

The latent shape `[B, 4, 48, 48]` corresponds to the VAE's 8x downsampling from 384x384 pixel space
The 4 channels match the SDXL VAE's latent dimension
The noise is sampled in bfloat16 precision

Step 5: ODE Denoising Loop

Iterate for `num_inference_steps` (default: 30) to progressively denoise the latent from pure noise to a clean sample. At each step:

Encode: Pass the current latent (duplicated for CFG) and the current timestep through the ShallowUViT encoder, producing spatial embeddings, a timestep embedding, and hidden states
Align: Reshape and project the spatial embeddings through the generation encoder aligner to match the LLM's hidden dimension
LLM Forward: Concatenate the prompt embeddings, timestep embedding, and aligned latent embeddings, then run through the LLM backbone with KV-cache (text prompt portion cached after the first step)
Decode: Extract the last 576 positions from the LLM hidden states, normalize with RMSNorm, project through the decoder aligner, reshape to spatial format, and pass through the ShallowUViT decoder along with the encoder hidden states and timestep embedding to predict the velocity field
CFG: Combine conditional and unconditional velocity predictions: `v = cfg_weight * v_cond - (cfg_weight - 1) * v_uncond`
Euler Step: Update the latent: `z = z + dt * v`

What happens:

The LLM backbone serves as the core denoising network, guided by text context
KV-cache is used to avoid recomputing prompt embeddings at each step
The ShallowUViT encoder/decoder handle the latent-to-LLM and LLM-to-velocity transformations
Each step moves the latent along the learned flow trajectory from noise toward the data distribution

Step 6: VAE Decoding

Pass the final denoised latent through the SDXL VAE decoder to produce pixel-space images. The latent is first divided by the VAE's scaling factor before decoding.

Key considerations:

The VAE scaling factor is read from `vae.config.scaling_factor`
The output is in [-1, 1] range and needs rescaling for display
The VAE operates on the full batch simultaneously

Step 7: Post Processing and Saving

Clip the decoded pixel values to [-1, 1], rescale to [0, 1] for saving, and convert to uint8 format. Save the generated images using `torchvision.utils.save_image` or convert to PIL Images for display.

Key considerations:

Images are produced at the VAE's native resolution (384x384 by default)
The Gradio demo upscales to 1024x1024 via Lanczos resampling for display
Multiple images from the batch can be saved individually

Execution Diagram

GitHub URL

Workflow Repository