Heuristic:Volcengine Verl FSDP Mixed Precision Init

Metadata:

Sources: Repo|verl|https://github.com/volcengine/verl
Domains: Optimization, Distributed_Training
Last Updated: 2026-02-07 17:00 GMT

Overview

Always initialize models in fp32 before applying FSDP mixed precision to prevent optimizer state corruption.

Description

When using FSDP with mixed precision, models must be created in fp32. The FSDP mixed precision policy then casts parameters to bf16 for compute while keeping optimizer states in fp32. If models are initialized directly in bf16, the optimizer states will also be in bf16, leading to training instability and convergence issues.

Usage

Apply whenever setting up FSDP training in verl, especially when using the FSDPSFTTrainer or PPO actor/critic workers.

The Insight

Action: Initialize model in fp32, then configure FSDP mixed precision with param_dtype=bf16, reduce_dtype=fp32, buffer_dtype=fp32
Value: param_dtype=bf16 for compute, reduce_dtype=fp32 for gradient all-reduce, buffer_dtype=fp32 for buffers
Trade-off: fp32 initialization uses more memory temporarily, but ensures correct optimizer behavior

Reasoning

FSDP mixed precision casts parameters on-the-fly during forward/backward passes. If the model is already in bf16, the optimizer state dict will be created in bf16 precision, which loses precision for running averages (momentum, variance) in Adam/AdamW. This can cause training instability or NaN losses.

Code Evidence

From verl/workers/fsdp_workers.py:311:

# note that we have to create model in fp32. Otherwise, the optimizer is in bf16, which is incorrect

And mixed precision config from verl/workers/fsdp_workers.py:493-501:

# param_dtype: bf16 (default)
# reduce_dtype: fp32 (default)
# buffer_dtype: fp32 (default)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment