Environment:Hiyouga LLaMA Factory FP8 Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization |
| Last Updated | 2026-02-06 20:00 GMT |
Overview
FP8 mixed-precision training environment using either TorchAO or NVIDIA Transformer Engine backends, requiring Hopper/Ada Lovelace GPUs.
Description
FP8 training reduces memory usage and increases throughput by using 8-bit floating point for linear layer computations. LLaMA Factory supports two FP8 backends: TorchAO (default, uses Float8LinearConfig) and NVIDIA Transformer Engine (optimal for Hopper GPUs). FP8 is handled entirely through HuggingFace Accelerate. The system automatically skips embedding layers, LM heads, and layers with dimensions not divisible by 16 for numerical stability.
Usage
Use this environment when training on NVIDIA Hopper (H100/H200) or Ada Lovelace (RTX 4090) GPUs and you want to reduce memory usage while maintaining model quality. FP8 is incompatible with quantization.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | NVIDIA Hopper or Ada Lovelace GPU | H100, H200, L40S, RTX 4090 |
| VRAM | >= 24GB recommended | FP8 reduces memory vs fp16/bf16 |
| CUDA | >= 12.0 | Required for FP8 hardware support |
Dependencies
TorchAO Backend (Default)
torchaotorch>= 2.4.0
Transformer Engine Backend
transformer-enginetransformer-engine-extensions
Credentials
No additional credentials required.
Quick Install
# TorchAO backend (default)
pip install torchao torch>=2.4.0
# Transformer Engine backend (optimal for Hopper)
pip install transformer-engine transformer-engine-extensions
Code Evidence
FP8 backend selection from src/llamafactory/train/fp8_utils.py:44-62:
# Use Transformer Engine backend (optimal for Hopper GPUs)
if backend == "te":
from accelerate.utils import FP8RecipeKwargs
logger.info_rank0("Using Transformer Engine FP8 backend")
return [FP8RecipeKwargs(backend="TE", fp8_format="HYBRID", amax_history_len=16, amax_compute_algo="max")]
# Use TorchAO backend (default)
from accelerate.utils import AORecipeKwargs
config = None
if backend == "torchao" or backend == "auto":
from torchao.float8 import Float8LinearConfig
config = Float8LinearConfig.from_recipe_name("rowwise")
Dimension alignment check from src/llamafactory/train/fp8_utils.py:82-93:
# TorchAO FP8 requires dimensions divisible by 16 for optimal kernels
def module_filter_func(module, layer_name):
skip_layers = ["embed", "lm_head", "output", "classifier"]
if any(skip_name in layer_name.lower() for skip_name in skip_layers):
return False
weight = module.weight
in_features, out_features = weight.shape[1], weight.shape[0]
if in_features % 16 != 0 or out_features % 16 != 0:
return False
return True
FP8 incompatibility with quantization from src/llamafactory/hparams/parser.py:337-338:
if not finetuning_args.use_mca and training_args.fp8 and model_args.quantization_bit is not None:
raise ValueError("FP8 training is not compatible with quantization. Please disable one of them.")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
FP8 training is not compatible with quantization |
Using both FP8 and quantization_bit | Disable quantization when using FP8 |
Failed to create FP8 configuration |
Missing torchao or transformer-engine | Install the appropriate FP8 backend package |
fp8_enabled=False warning |
Accelerate not detecting FP8 | Verify GPU supports FP8 and packages are installed |
Compatibility Notes
- TorchAO Backend: Default backend. Uses rowwise scaling for better performance. Layers with dimensions not divisible by 16 are automatically skipped.
- Transformer Engine Backend: Optimal for NVIDIA Hopper GPUs. Uses HYBRID FP8 format with amax history length of 16.
- FSDP Integration: Supports
fp8_enable_fsdp_float8_all_gatheroptimization for distributed FP8 training. - Incompatible with: BitsAndBytes quantization, HQQ, EETQ, and other quantization methods.