Heuristic:Huggingface Diffusers Dtype Precision Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-13 21:00 GMT |
Overview
Precision selection rules for Diffusers inference and training: float16 is the default for inference, bfloat16 for Ampere+ training, with mandatory float32 casting for FFT operations and non-power-of-2 images.
Description
Diffusers uses half-precision (float16 or bfloat16) for most operations but has specific situations that require float32 casting. The codebase contains explicit dtype guards: FFT operations (used in FreeU) do not support bfloat16 and require float32 casting; non-power-of-2 image dimensions also require float32 for FFT. Flash Attention and SageAttention backends require bfloat16 or float16 inputs — float32 will fail. Random tensor generation has device-dtype interaction: CPU generators used for GPU tensors trigger a warning about performance and a CPU-to-GPU transfer. MPS (Apple Silicon) has known issues with GELU activation at float16 precision in PyTorch < 2.0, requiring a float32 workaround.
Usage
Relevant when choosing pipeline precision (`pipe.to(torch.float16)` vs `torch.bfloat16`), debugging visual artifacts (NaN or color issues from dtype mismatches), or optimizing training precision (mixed precision with Accelerate).
The Insight (Rule of Thumb)
- Inference: Use `torch.float16` for most pipelines. Use `torch.bfloat16` on Ampere+ GPUs for wider dynamic range.
- Training: Use `bf16` mixed precision via Accelerate on Ampere+ GPUs. Fall back to `fp16` on older GPUs.
- FFT operations (FreeU): Always cast to `float32` — bfloat16 is not supported by `torch.fft.fftn`.
- Non-power-of-2 images: Cast to `float32` before FFT (Fourier filter).
- Flash/Sage Attention: Inputs must be bfloat16 or float16. Float32 will raise errors.
- CPU generators for GPU tensors: Creates tensor on CPU first, then moves to GPU (performance warning).
- MPS + GELU: In PyTorch < 2.0, cast to float32 for GELU activation on Apple Silicon.
- TF32: Enable `torch.backends.cuda.matmul.allow_tf32 = True` for faster Ampere training (set in example scripts).
- Deterministic mode: Disables TF32 via `torch.backends.cuda.matmul.allow_tf32 = False`.
Reasoning
The FFT casting guard exists because `torch.fft.fftn` has limited dtype support — bfloat16 tensors cause silent numerical errors or runtime failures. The non-power-of-2 guard ensures FFT works correctly since non-power-of-2 FFT uses Bluestein's algorithm which has different precision requirements.
Flash Attention's bf16/fp16 requirement comes from its hardware-optimized kernels that operate on half-precision tensor cores. The performance benefit comes precisely from avoiding float32 computation paths.
The CPU generator warning is a common performance pitfall: when users create a `torch.Generator()` (defaults to CPU) but target GPU tensors, every random generation involves a CPU-to-GPU transfer. Creating the generator on the target device avoids this overhead.