Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haotian liu LLaVA DeepSpeed ZeRO Configuration

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-13 00:00 GMT

Overview

Configuration files for DeepSpeed ZeRO distributed training used by LLaVA's training pipeline. Two JSON configuration files control memory partitioning strategy, mixed precision settings, and communication optimization for the two-stage training process.

Description

LLaVA uses DeepSpeed ZeRO configurations to manage multi-GPU memory during training. Two config files are provided:

  • zero2.json -- Used for Stage 1 pretraining (feature alignment). Enables optimizer state and gradient partitioning (ZeRO Stage 2), which is sufficient when only the ~30M-parameter projector is being trained.
  • zero3.json -- Used for Stage 2 finetuning (visual instruction tuning). Enables full parameter partitioning (ZeRO Stage 3), necessary when the entire 13B-parameter LLM is unfrozen for end-to-end training.

Both configurations enable BF16 mixed precision via the "bf16" block with "enabled": "auto", which defers to the CLI argument --bf16 True. Batch sizes and gradient accumulation steps are set to "auto", inheriting values from HuggingFace Trainer CLI arguments.

Usage

Pass the appropriate configuration file via the --deepspeed CLI argument when launching training:

# Stage 1: Feature alignment pretraining with ZeRO-2
deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --tune_mm_mlp_adapter True \
    ...

# Stage 2: Visual instruction tuning with ZeRO-3
deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    ...

Code Reference

Source Location

zero2.json Configuration

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto"
    }
}

zero3.json Configuration

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

Import

N/A -- These are JSON configuration files, not Python modules. They are consumed by the DeepSpeed runtime via the --deepspeed CLI argument.

I/O Contract

Inputs

Input Contract
Name Type Description
--deepspeed str (CLI arg) Path to the DeepSpeed JSON config file. Passed to the training script.
--bf16 True bool (CLI arg) Activates BF16 mixed precision. Resolves the "auto" setting in the JSON config.
--per_device_train_batch_size int (CLI arg) Per-GPU batch size. Resolves "train_micro_batch_size_per_gpu": "auto".
--gradient_accumulation_steps int (CLI arg) Number of gradient accumulation steps. Resolves "gradient_accumulation_steps": "auto".

Outputs

Output Contract
Name Type Description
Configured DeepSpeed engine DeepSpeed Runtime A fully initialized distributed training environment with the specified ZeRO stage, mixed precision, and communication optimizations.

Key Configuration Parameters

Parameter Comparison: zero2.json vs zero3.json
Parameter zero2.json zero3.json Purpose
zero_optimization.stage 2 3 ZeRO partitioning level
zero_optimization.overlap_comm true true Overlap communication with computation
zero_optimization.contiguous_gradients true true Reduce memory fragmentation for gradients
stage3_prefetch_bucket_size N/A "auto" Prefetch buffer for ZeRO-3 parameter gathering
stage3_param_persistence_threshold N/A "auto" Small parameters kept on all GPUs
stage3_gather_16bit_weights_on_model_save N/A true Gather full weights for checkpoint saving
bf16.enabled "auto" "auto" BF16 mixed precision (resolved by CLI)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment