Heuristic:Huggingface Diffusers Dtype Precision Selection

Knowledge Sources	Huggingface Diffusers PyTorch Mixed Precision
Domains	Optimization, Deep_Learning
Last Updated	2026-02-13 21:00 GMT

Overview

Precision selection rules for Diffusers inference and training: float16 is the default for inference, bfloat16 for Ampere+ training, with mandatory float32 casting for FFT operations and non-power-of-2 images.

Description

Diffusers uses half-precision (float16 or bfloat16) for most operations but has specific situations that require float32 casting. The codebase contains explicit dtype guards: FFT operations (used in FreeU) do not support bfloat16 and require float32 casting; non-power-of-2 image dimensions also require float32 for FFT. Flash Attention and SageAttention backends require bfloat16 or float16 inputs — float32 will fail. Random tensor generation has device-dtype interaction: CPU generators used for GPU tensors trigger a warning about performance and a CPU-to-GPU transfer. MPS (Apple Silicon) has known issues with GELU activation at float16 precision in PyTorch < 2.0, requiring a float32 workaround.

Usage

Relevant when choosing pipeline precision (`pipe.to(torch.float16)` vs `torch.bfloat16`), debugging visual artifacts (NaN or color issues from dtype mismatches), or optimizing training precision (mixed precision with Accelerate).

The Insight (Rule of Thumb)

Inference: Use `torch.float16` for most pipelines. Use `torch.bfloat16` on Ampere+ GPUs for wider dynamic range.
Training: Use `bf16` mixed precision via Accelerate on Ampere+ GPUs. Fall back to `fp16` on older GPUs.
FFT operations (FreeU): Always cast to `float32` — bfloat16 is not supported by `torch.fft.fftn`.
Non-power-of-2 images: Cast to `float32` before FFT (Fourier filter).
Flash/Sage Attention: Inputs must be bfloat16 or float16. Float32 will raise errors.
CPU generators for GPU tensors: Creates tensor on CPU first, then moves to GPU (performance warning).
MPS + GELU: In PyTorch < 2.0, cast to float32 for GELU activation on Apple Silicon.
TF32: Enable `torch.backends.cuda.matmul.allow_tf32 = True` for faster Ampere training (set in example scripts).
Deterministic mode: Disables TF32 via `torch.backends.cuda.matmul.allow_tf32 = False`.

Reasoning

The FFT casting guard exists because `torch.fft.fftn` has limited dtype support — bfloat16 tensors cause silent numerical errors or runtime failures. The non-power-of-2 guard ensures FFT works correctly since non-power-of-2 FFT uses Bluestein's algorithm which has different precision requirements.

Flash Attention's bf16/fp16 requirement comes from its hardware-optimized kernels that operate on half-precision tensor cores. The performance benefit comes precisely from avoiding float32 computation paths.

The CPU generator warning is a common performance pitfall: when users create a `torch.Generator()` (defaults to CPU) but target GPU tensors, every random generation involves a CPU-to-GPU transfer. Creating the generator on the target device avoids this overhead.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment