Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hiyouga LLaMA Factory FP8 Mixed Precision

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Hardware Optimization
Last Updated 2026-02-06 19:00 GMT

Overview

FP8 mixed precision is a training efficiency technique that uses 8-bit floating-point arithmetic for matrix multiplications in linear layers, approximately doubling throughput on supported hardware (NVIDIA Hopper, Ada Lovelace) while maintaining model quality through selective layer conversion and scaling strategies.

Description

Traditional mixed-precision training uses FP16 or BF16 for forward and backward passes with FP32 master weights. FP8 takes this further by quantizing the inputs and weights of linear layers to 8-bit floating point, enabling the use of specialized FP8 tensor cores that deliver approximately 2x the throughput of BF16 tensor cores.

However, FP8 has a significantly narrower dynamic range than FP16/BF16, requiring careful management of scaling factors to prevent overflow or underflow. LLaMA-Factory's FP8 implementation supports two backends:

  • TorchAO backend (default): Uses PyTorch's native FP8 support via torchao.float8.Float8LinearConfig with rowwise scaling for optimal kernel performance. This backend automatically handles per-tensor scaling and is integrated through HuggingFace Accelerate's AORecipeKwargs.
  • Transformer Engine (TE) backend: Uses NVIDIA's Transformer Engine library with HYBRID FP8 format (combining E4M3 for forward and E5M2 for backward), AMAX-based dynamic scaling with a configurable history length, and integration through Accelerate's FP8RecipeKwargs.

The implementation applies FP8 conversion selectively using a module filter function that:

  1. Skips embedding layers, language model heads, and classifier layers for numerical stability.
  2. Only converts nn.Linear layers with 2D weight matrices.
  3. Verifies that both input and output feature dimensions are divisible by 16, a requirement of FP8 GEMM kernels.

Usage

FP8 mixed precision should be used when:

  • Training on NVIDIA H100/H200 (Hopper) or L40/RTX 4090 (Ada Lovelace) GPUs that have dedicated FP8 tensor cores.
  • Training throughput is a bottleneck and model quality degradation from reduced precision is acceptable or manageable.
  • Combined with FSDP, where FP8 all-gather optimization can further reduce communication bandwidth by transmitting FP8 tensors instead of BF16.

Enable via the fp8 flag in training arguments, with optional fp8_backend selection ("auto", "torchao", or "te").

Theoretical Basis

FP8 uses 8 bits to represent floating-point numbers in two formats:

Format Exponent Bits Mantissa Bits Dynamic Range Precision Typical Use
E4M3 4 3 ±448 Higher Forward pass
E5M2 5 2 ±57344 Lower Backward pass (gradients)

The HYBRID format uses E4M3 for forward activations and weights (where precision matters more) and E5M2 for gradients (where dynamic range matters more).

Per-tensor scaling is essential for FP8 to work effectively. Each tensor is associated with a scaling factor s that maps the tensor's value range into the FP8 representable range:

Failed to parse (syntax error): {\displaystyle x_{\text{fp8}} = \text{quantize}_{\text{fp8}}\left(\frac{x}{s}\right), \quad s = \frac{\max(|x|)}{\text{fp8\_max}}}

The dequantized value is recovered as x^=xfp8×s.

Dynamic scaling uses an AMAX (absolute maximum) history to compute the scaling factor:

Failed to parse (syntax error): {\displaystyle s_t = \frac{\text{amax\_history}[t]}{\text{fp8\_max}}}

where amax_history tracks the maximum absolute values over a sliding window (default length 16 in the TE backend), and the amax_compute_algo determines how the window is aggregated (typically "max").

Rowwise scaling (used by the TorchAO backend) computes a separate scaling factor for each row of the weight matrix, providing finer granularity:

config = Float8LinearConfig.from_recipe_name("rowwise")

The module filter function ensures safe conversion by skipping layers that are sensitive to quantization or have incompatible dimensions:

def module_filter_func(module, layer_name):
    skip_layers = ["embed", "lm_head", "output", "classifier"]
    if any(skip_name in layer_name.lower() for skip_name in skip_layers):
        return False
    if not (hasattr(module, "weight") and len(module.weight.shape) == 2):
        return False
    in_features, out_features = module.weight.shape[1], module.weight.shape[0]
    if in_features % 16 != 0 or out_features % 16 != 0:
        return False
    return True

The dimension alignment requirement (dimmod16=0) stems from the FP8 tensor core's tile size: NVIDIA's FP8 GEMM kernels operate on 16x16 tiles, and misaligned dimensions require padding that eliminates the performance benefit.

For FSDP integration, FP8 all-gather communicates parameters in FP8 format, reducing communication volume by 2x compared to BF16:

if training_args.fp8_enable_fsdp_float8_all_gather:
    os.environ["FP8_ENABLE_FSDP_FLOAT8_ALL_GATHER"] = "true"

The Accelerator monkey-patch mechanism ensures that FP8 recipe kwargs are injected into the Accelerator even when the HuggingFace Trainer does not natively pass them:

def patched_init(self, *args, **kwargs):
    kwargs["kwargs_handlers"] = [FP8Recipe(backend="TE", fp8_format="HYBRID", ...)]
    kwargs["mixed_precision"] = "fp8"
    return original_init(self, *args, **kwargs)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment