Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Hiyouga LLaMA Factory FP8 Training Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Optimization
Last Updated 2026-02-06 20:00 GMT

Overview

FP8 mixed-precision training environment using either TorchAO or NVIDIA Transformer Engine backends, requiring Hopper/Ada Lovelace GPUs.

Description

FP8 training reduces memory usage and increases throughput by using 8-bit floating point for linear layer computations. LLaMA Factory supports two FP8 backends: TorchAO (default, uses Float8LinearConfig) and NVIDIA Transformer Engine (optimal for Hopper GPUs). FP8 is handled entirely through HuggingFace Accelerate. The system automatically skips embedding layers, LM heads, and layers with dimensions not divisible by 16 for numerical stability.

Usage

Use this environment when training on NVIDIA Hopper (H100/H200) or Ada Lovelace (RTX 4090) GPUs and you want to reduce memory usage while maintaining model quality. FP8 is incompatible with quantization.

System Requirements

Category Requirement Notes
Hardware NVIDIA Hopper or Ada Lovelace GPU H100, H200, L40S, RTX 4090
VRAM >= 24GB recommended FP8 reduces memory vs fp16/bf16
CUDA >= 12.0 Required for FP8 hardware support

Dependencies

TorchAO Backend (Default)

  • torchao
  • torch >= 2.4.0

Transformer Engine Backend

  • transformer-engine
  • transformer-engine-extensions

Credentials

No additional credentials required.

Quick Install

# TorchAO backend (default)
pip install torchao torch>=2.4.0

# Transformer Engine backend (optimal for Hopper)
pip install transformer-engine transformer-engine-extensions

Code Evidence

FP8 backend selection from src/llamafactory/train/fp8_utils.py:44-62:

# Use Transformer Engine backend (optimal for Hopper GPUs)
if backend == "te":
    from accelerate.utils import FP8RecipeKwargs
    logger.info_rank0("Using Transformer Engine FP8 backend")
    return [FP8RecipeKwargs(backend="TE", fp8_format="HYBRID", amax_history_len=16, amax_compute_algo="max")]

# Use TorchAO backend (default)
from accelerate.utils import AORecipeKwargs
config = None
if backend == "torchao" or backend == "auto":
    from torchao.float8 import Float8LinearConfig
    config = Float8LinearConfig.from_recipe_name("rowwise")

Dimension alignment check from src/llamafactory/train/fp8_utils.py:82-93:

# TorchAO FP8 requires dimensions divisible by 16 for optimal kernels
def module_filter_func(module, layer_name):
    skip_layers = ["embed", "lm_head", "output", "classifier"]
    if any(skip_name in layer_name.lower() for skip_name in skip_layers):
        return False
    weight = module.weight
    in_features, out_features = weight.shape[1], weight.shape[0]
    if in_features % 16 != 0 or out_features % 16 != 0:
        return False
    return True

FP8 incompatibility with quantization from src/llamafactory/hparams/parser.py:337-338:

if not finetuning_args.use_mca and training_args.fp8 and model_args.quantization_bit is not None:
    raise ValueError("FP8 training is not compatible with quantization. Please disable one of them.")

Common Errors

Error Message Cause Solution
FP8 training is not compatible with quantization Using both FP8 and quantization_bit Disable quantization when using FP8
Failed to create FP8 configuration Missing torchao or transformer-engine Install the appropriate FP8 backend package
fp8_enabled=False warning Accelerate not detecting FP8 Verify GPU supports FP8 and packages are installed

Compatibility Notes

  • TorchAO Backend: Default backend. Uses rowwise scaling for better performance. Layers with dimensions not divisible by 16 are automatically skipped.
  • Transformer Engine Backend: Optimal for NVIDIA Hopper GPUs. Uses HYBRID FP8 format with amax history length of 16.
  • FSDP Integration: Supports fp8_enable_fsdp_float8_all_gather optimization for distributed FP8 training.
  • Incompatible with: BitsAndBytes quantization, HQQ, EETQ, and other quantization methods.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment