Heuristic:Huggingface Transformers Mixed Precision Training Selection

Knowledge Sources	Huggingface Transformers training_args.py
Domains	Optimization, Training, Precision
Last Updated	2026-02-13 20:00 GMT

Overview

Decision framework for choosing between bf16 and fp16 mixed precision training based on GPU architecture and model requirements.

Description

Mixed precision training uses lower-precision floating point formats (bf16 or fp16) for most operations while keeping a master copy of weights in fp32 for numerical stability. The choice between bf16 (Brain Floating Point 16) and fp16 (IEEE half-precision) depends on hardware support and model architecture. bf16 has the same exponent range as fp32 (avoiding overflow/underflow issues) but less mantissa precision. fp16 has a limited exponent range requiring loss scaling but more mantissa bits. The Trainer also supports tf32 mode, which uses TensorFloat-32 for matrix multiplications on Ampere+ GPUs.

Usage

Apply this decision framework when configuring TrainingArguments for any training or fine-tuning run. The wrong choice can cause NaN losses, training instability, or suboptimal performance.

The Insight (Rule of Thumb)

Action 1 (Preferred): Set bf16=True if your GPU supports it (Ampere A100, Hopper H100, or newer).
Action 2 (Fallback): Set fp16=True if bf16 is not available (older V100, T4, RTX 20xx GPUs).
Action 3 (Bonus): Set tf32=True on Ampere+ GPUs for additional speedup with near-fp32 accuracy in matmuls.
Trade-off: Both bf16 and fp16 cut memory usage roughly in half and speed up training. bf16 is more stable but only available on newer GPUs. fp16 requires loss scaling (handled automatically by Trainer).
Warning: The Trainer warns when torch_compile is enabled but the GPU is not Ampere or higher, as torchdynamo speedups mostly require Ampere+ architecture.

Reasoning

bf16 avoids the overflow/underflow problems of fp16 because it maintains the full fp32 exponent range (8 bits), making it inherently more stable for training. fp16 has only 5 exponent bits, so gradients can easily overflow during training, requiring a loss scaler. On GPUs that support both, bf16 is strictly preferred because it eliminates an entire class of numerical issues. TF32 mode is orthogonal and can be combined with either; it uses Tensor Cores to accelerate fp32 matrix multiplications with TF32 precision (10-bit mantissa) at no memory cost.

Code Evidence

torch_compile GPU architecture warning from src/transformers/training_args.py:1585-1588:

else:
    logger.warning(
        "The speedups for torchdynamo mostly come with GPU Ampere or higher and which is not detected here."
    )

tf32 enable logic from src/transformers/training_args.py:1582-1584:

if is_torch_cuda_available() and torch.cuda.get_device_capability()[0] >= 8:
    if self.tf32 is not None:
        enable_tf32(True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment