Environment:Hiyouga LLaMA Factory FP8 Training Environment

Knowledge Sources	LLaMA-Factory torchao Transformer Engine
Domains	Infrastructure, Optimization
Last Updated	2026-02-06 20:00 GMT

Overview

FP8 mixed-precision training environment using either TorchAO or NVIDIA Transformer Engine backends, requiring Hopper/Ada Lovelace GPUs.

Description

FP8 training reduces memory usage and increases throughput by using 8-bit floating point for linear layer computations. LLaMA Factory supports two FP8 backends: TorchAO (default, uses Float8LinearConfig) and NVIDIA Transformer Engine (optimal for Hopper GPUs). FP8 is handled entirely through HuggingFace Accelerate. The system automatically skips embedding layers, LM heads, and layers with dimensions not divisible by 16 for numerical stability.

Usage

Use this environment when training on NVIDIA Hopper (H100/H200) or Ada Lovelace (RTX 4090) GPUs and you want to reduce memory usage while maintaining model quality. FP8 is incompatible with quantization.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA Hopper or Ada Lovelace GPU	H100, H200, L40S, RTX 4090
VRAM	>= 24GB recommended	FP8 reduces memory vs fp16/bf16
CUDA	>= 12.0	Required for FP8 hardware support

Dependencies

TorchAO Backend (Default)

torchao
torch >= 2.4.0

Transformer Engine Backend

transformer-engine
transformer-engine-extensions

Credentials

No additional credentials required.

Quick Install

# TorchAO backend (default)
pip install torchao torch>=2.4.0

# Transformer Engine backend (optimal for Hopper)
pip install transformer-engine transformer-engine-extensions

Code Evidence

FP8 backend selection from src/llamafactory/train/fp8_utils.py:44-62:

# Use Transformer Engine backend (optimal for Hopper GPUs)
if backend == "te":
    from accelerate.utils import FP8RecipeKwargs
    logger.info_rank0("Using Transformer Engine FP8 backend")
    return [FP8RecipeKwargs(backend="TE", fp8_format="HYBRID", amax_history_len=16, amax_compute_algo="max")]

# Use TorchAO backend (default)
from accelerate.utils import AORecipeKwargs
config = None
if backend == "torchao" or backend == "auto":
    from torchao.float8 import Float8LinearConfig
    config = Float8LinearConfig.from_recipe_name("rowwise")

Dimension alignment check from src/llamafactory/train/fp8_utils.py:82-93:

# TorchAO FP8 requires dimensions divisible by 16 for optimal kernels
def module_filter_func(module, layer_name):
    skip_layers = ["embed", "lm_head", "output", "classifier"]
    if any(skip_name in layer_name.lower() for skip_name in skip_layers):
        return False
    weight = module.weight
    in_features, out_features = weight.shape[1], weight.shape[0]
    if in_features % 16 != 0 or out_features % 16 != 0:
        return False
    return True

FP8 incompatibility with quantization from src/llamafactory/hparams/parser.py:337-338:

if not finetuning_args.use_mca and training_args.fp8 and model_args.quantization_bit is not None:
    raise ValueError("FP8 training is not compatible with quantization. Please disable one of them.")

Common Errors

Error Message	Cause	Solution
`FP8 training is not compatible with quantization`	Using both FP8 and quantization_bit	Disable quantization when using FP8
`Failed to create FP8 configuration`	Missing torchao or transformer-engine	Install the appropriate FP8 backend package
`fp8_enabled=False` warning	Accelerate not detecting FP8	Verify GPU supports FP8 and packages are installed

Compatibility Notes

TorchAO Backend: Default backend. Uses rowwise scaling for better performance. Layers with dimensions not divisible by 16 are automatically skipped.
Transformer Engine Backend: Optimal for NVIDIA Hopper GPUs. Uses HYBRID FP8 format with amax history length of 16.
FSDP Integration: Supports fp8_enable_fsdp_float8_all_gather optimization for distributed FP8 training.
Incompatible with: BitsAndBytes quantization, HQQ, EETQ, and other quantization methods.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment