Principle:Hiyouga LLaMA Factory FP8 Mixed Precision

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Deep Learning, Hardware Optimization
Last Updated	2026-02-06 19:00 GMT

Overview

FP8 mixed precision is a training efficiency technique that uses 8-bit floating-point arithmetic for matrix multiplications in linear layers, approximately doubling throughput on supported hardware (NVIDIA Hopper, Ada Lovelace) while maintaining model quality through selective layer conversion and scaling strategies.

Description

Traditional mixed-precision training uses FP16 or BF16 for forward and backward passes with FP32 master weights. FP8 takes this further by quantizing the inputs and weights of linear layers to 8-bit floating point, enabling the use of specialized FP8 tensor cores that deliver approximately 2x the throughput of BF16 tensor cores.

However, FP8 has a significantly narrower dynamic range than FP16/BF16, requiring careful management of scaling factors to prevent overflow or underflow. LLaMA-Factory's FP8 implementation supports two backends:

TorchAO backend (default): Uses PyTorch's native FP8 support via torchao.float8.Float8LinearConfig with rowwise scaling for optimal kernel performance. This backend automatically handles per-tensor scaling and is integrated through HuggingFace Accelerate's AORecipeKwargs.
Transformer Engine (TE) backend: Uses NVIDIA's Transformer Engine library with HYBRID FP8 format (combining E4M3 for forward and E5M2 for backward), AMAX-based dynamic scaling with a configurable history length, and integration through Accelerate's FP8RecipeKwargs.

The implementation applies FP8 conversion selectively using a module filter function that:

Skips embedding layers, language model heads, and classifier layers for numerical stability.
Only converts nn.Linear layers with 2D weight matrices.
Verifies that both input and output feature dimensions are divisible by 16, a requirement of FP8 GEMM kernels.

Usage

FP8 mixed precision should be used when:

Training on NVIDIA H100/H200 (Hopper) or L40/RTX 4090 (Ada Lovelace) GPUs that have dedicated FP8 tensor cores.
Training throughput is a bottleneck and model quality degradation from reduced precision is acceptable or manageable.
Combined with FSDP, where FP8 all-gather optimization can further reduce communication bandwidth by transmitting FP8 tensors instead of BF16.

Enable via the fp8 flag in training arguments, with optional fp8_backend selection ("auto", "torchao", or "te").

Theoretical Basis

FP8 uses 8 bits to represent floating-point numbers in two formats:

Format	Exponent Bits	Mantissa Bits	Dynamic Range	Precision	Typical Use
E4M3	4	3	$\pm 448$	Higher	Forward pass
E5M2	5	2	$\pm 57344$	Lower	Backward pass (gradients)

The HYBRID format uses E4M3 for forward activations and weights (where precision matters more) and E5M2 for gradients (where dynamic range matters more).

Per-tensor scaling is essential for FP8 to work effectively. Each tensor is associated with a scaling factor $s$ that maps the tensor's value range into the FP8 representable range:

Failed to parse (syntax error): {\displaystyle x_{\text{fp8}} = \text{quantize}_{\text{fp8}}\left(\frac{x}{s}\right), \quad s = \frac{\max(|x|)}{\text{fp8\_max}}}

The dequantized value is recovered as $\hat{x} = x_{fp8} \times s$ .

Dynamic scaling uses an AMAX (absolute maximum) history to compute the scaling factor:

Failed to parse (syntax error): {\displaystyle s_t = \frac{\text{amax\_history}[t]}{\text{fp8\_max}}}

where amax_history tracks the maximum absolute values over a sliding window (default length 16 in the TE backend), and the amax_compute_algo determines how the window is aggregated (typically "max").

Rowwise scaling (used by the TorchAO backend) computes a separate scaling factor for each row of the weight matrix, providing finer granularity:

config = Float8LinearConfig.from_recipe_name("rowwise")

The module filter function ensures safe conversion by skipping layers that are sensitive to quantization or have incompatible dimensions:

def module_filter_func(module, layer_name):
    skip_layers = ["embed", "lm_head", "output", "classifier"]
    if any(skip_name in layer_name.lower() for skip_name in skip_layers):
        return False
    if not (hasattr(module, "weight") and len(module.weight.shape) == 2):
        return False
    in_features, out_features = module.weight.shape[1], module.weight.shape[0]
    if in_features % 16 != 0 or out_features % 16 != 0:
        return False
    return True

The dimension alignment requirement ( $dim mod 1 6 = 0$ ) stems from the FP8 tensor core's tile size: NVIDIA's FP8 GEMM kernels operate on 16x16 tiles, and misaligned dimensions require padding that eliminates the performance benefit.

For FSDP integration, FP8 all-gather communicates parameters in FP8 format, reducing communication volume by 2x compared to BF16:

if training_args.fp8_enable_fsdp_float8_all_gather:
    os.environ["FP8_ENABLE_FSDP_FLOAT8_ALL_GATHER"] = "true"

The Accelerator monkey-patch mechanism ensures that FP8 recipe kwargs are injected into the Accelerator even when the HuggingFace Trainer does not natively pass them:

def patched_init(self, *args, **kwargs):
    kwargs["kwargs_handlers"] = [FP8Recipe(backend="TE", fp8_format="HYBRID", ...)]
    kwargs["mixed_precision"] = "fp8"
    return original_init(self, *args, **kwargs)

Related Pages

Implementation:Hiyouga_LLaMA_Factory_FP8_Utils

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment