Heuristic:Huggingface Diffusers VAE Scaling Factors
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Diffusion_Models |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Critical magic constants for VAE latent space normalization: Stable Diffusion uses 0.18215, while newer architectures use model-specific scaling factors ranging from 0.13235 to 1.03682.
Description
The VAE scaling factor normalizes the latent space distribution to have unit variance, which is critical for the denoising process to work correctly. The default Stable Diffusion v1/v2 scaling factor of 0.18215 was computed from the training data statistics and is hardcoded in `single_file_utils.py`. When converting checkpoints from non-Diffusers formats, these factors must be correctly identified and set in the model config. Incorrect scaling factors will produce garbled or over/under-saturated images.
Additionally, image dimensions must be divisible by 8 (the VAE spatial downsampling factor) or generation will produce artifacts. VAE tiling uses a default 25% overlap between adjacent tiles to reduce visible seams.
Usage
Relevant when converting checkpoints from external formats (CivitAI, original CompVis, etc.), debugging image quality issues (washed out or oversaturated images), or implementing custom pipelines that need to manually scale latents.
The Insight (Rule of Thumb)
- Standard SD 1.x/2.x: `vae_scaling_factor = 0.18215`
- Playground: `vae_scaling_factor = 0.5`
- HunyuanVideo: `vae_scaling_factor = 0.476986`
- Allegro: `vae_scaling_factor = 0.13235`
- HunyuanImage Refiner: `vae_scaling_factor = 1.03682`
- MagVIT: `vae_scaling_factor = 0.7125`
- DC (Deep Compression): `vae_scaling_factor = 1.0`
- Image dimensions: Must be divisible by 8 (VAE downsampling factor)
- Tiling overlap: Default 25% overlap factor between tiles (e.g., `blend_height = tile_height * 0.25`)
- Trade-off: Wrong scaling factor produces garbled or discolored images — this is a common source of bugs in checkpoint conversions.
Reasoning
The VAE encoder maps pixel space to a latent space with a learned distribution. Without normalization, the latent variance can be large, making the noise schedule ineffective. The scaling factor `1 / std(latents)` was computed empirically on training data. For SD 1.x/2.x, this value is 0.18215 (approximately `1 / 5.49`). Newer architectures like HunyuanVideo and Allegro have different latent space properties due to different VAE architectures (3D VAEs for video, DC-AE for high compression), hence different scaling factors.
The divisible-by-8 requirement comes from the VAE's downsampling architecture: 3 downsampling blocks each halve spatial dimensions (8x total). Non-compliant dimensions cause shape mismatches in the decoder.