Heuristic:AUTOMATIC1111 Stable diffusion webui GTX 16 Series FP16 Workaround
| Knowledge Sources | |
|---|---|
| Domains | Hardware, Debugging |
| Last Updated | 2026-02-08 08:00 GMT |
Overview
Workaround enabling fp16 inference on NVIDIA GTX 16 series GPUs (Turing, compute capability 7.5) by enabling cuDNN benchmark mode and disabling CUDA autocast in favor of manual dtype casting.
Description
NVIDIA GTX 16 series cards (GTX 1650, 1660, etc.) have CUDA compute capability 7.5 but lack full fp16 Tensor Core support found in RTX cards. This causes CUDA autocast to produce incorrect results or NaN values. The WebUI detects these cards by checking both compute capability (7, 5) and device name prefix "NVIDIA GeForce GTX 16", then applies two compensating measures: enabling `torch.backends.cudnn.benchmark = True` and routing all dtype casting through a manual casting path instead of `torch.autocast`.
Usage
This workaround is applied automatically when a GTX 16 series GPU is detected. Users do not need to take any action. If NaN outputs still occur, additionally use `--upcast-sampling` or `--no-half` as escalation measures.
The Insight (Rule of Thumb)
- Action: Enable `torch.backends.cudnn.benchmark = True` and use `manual_cast()` instead of `torch.autocast("cuda")` for GTX 16 series cards.
- Value: Detected by compute capability == (7, 5) AND device name starting with "NVIDIA GeForce GTX 16".
- Trade-off: cuDNN benchmark mode adds a one-time overhead for each unique tensor shape as cuDNN searches for the optimal algorithm. Subsequent calls with the same shape are faster.
- Scope: Affects all inference precision handling. The manual cast path casts inputs to target dtype, runs the operation, then casts results back.
Reasoning
GTX 16 series GPUs use the Turing architecture without dedicated fp16 Tensor Cores (unlike RTX 20/30/40 series). PyTorch's `torch.autocast("cuda")` assumes full fp16 support and can produce NaN outputs on these cards. The cuDNN benchmark mode enables a broader range of algorithm choices that happen to work correctly with fp16 on these cards. The manual casting approach provides fine-grained dtype control that avoids the problematic autocast code paths.
This is a detection heuristic rather than a capability check: the specific combination of compute capability 7.5 AND "GTX 16" name prefix is needed because other Turing cards (like the T4 data center GPU) have the same compute capability but full fp16 support.
Code Evidence
Detection logic from `modules/devices.py:26-32`:
def cuda_no_autocast(device_id=None) -> bool:
if device_id is None:
device_id = get_cuda_device_id()
return (
torch.cuda.get_device_capability(device_id) == (7, 5)
and torch.cuda.get_device_name(device_id).startswith("NVIDIA GeForce GTX 16")
)
Benchmark mode activation from `modules/devices.py:101-107`:
def enable_tf32():
if torch.cuda.is_available():
# enabling benchmark option seems to enable a range of cards to do fp16
# when they otherwise can't
# see https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4407
if cuda_no_autocast():
torch.backends.cudnn.benchmark = True
Manual cast routing from `modules/devices.py:228-229`:
if has_xpu() or has_mps() or cuda_no_autocast():
return manual_cast(dtype)