Heuristic:Volcengine Verl FSDP Mixed Precision Init
Metadata:
- Sources: Repo|verl|https://github.com/volcengine/verl
- Domains: Optimization, Distributed_Training
- Last Updated: 2026-02-07 17:00 GMT
Overview
Always initialize models in fp32 before applying FSDP mixed precision to prevent optimizer state corruption.
Description
When using FSDP with mixed precision, models must be created in fp32. The FSDP mixed precision policy then casts parameters to bf16 for compute while keeping optimizer states in fp32. If models are initialized directly in bf16, the optimizer states will also be in bf16, leading to training instability and convergence issues.
Usage
Apply whenever setting up FSDP training in verl, especially when using the FSDPSFTTrainer or PPO actor/critic workers.
The Insight
- Action: Initialize model in fp32, then configure FSDP mixed precision with param_dtype=bf16, reduce_dtype=fp32, buffer_dtype=fp32
- Value: param_dtype=bf16 for compute, reduce_dtype=fp32 for gradient all-reduce, buffer_dtype=fp32 for buffers
- Trade-off: fp32 initialization uses more memory temporarily, but ensures correct optimizer behavior
Reasoning
FSDP mixed precision casts parameters on-the-fly during forward/backward passes. If the model is already in bf16, the optimizer state dict will be created in bf16 precision, which loses precision for running averages (momentum, variance) in Adam/AdamW. This can cause training instability or NaN losses.
Code Evidence
From verl/workers/fsdp_workers.py:311:
# note that we have to create model in fp32. Otherwise, the optimizer is in bf16, which is incorrect
And mixed precision config from verl/workers/fsdp_workers.py:493-501:
# param_dtype: bf16 (default)
# reduce_dtype: fp32 (default)
# buffer_dtype: fp32 (default)