Implementation:Haotian liu LLaVA DeepSpeed ZeRO Configuration

**Metadata**
Knowledge Sources	LLaVA DeepSpeed ZeRO
Domains	Distributed_Training Memory_Optimization
Last Updated	2026-02-13 00:00 GMT

Overview

Configuration files for DeepSpeed ZeRO distributed training used by LLaVA's training pipeline. Two JSON configuration files control memory partitioning strategy, mixed precision settings, and communication optimization for the two-stage training process.

Description

LLaVA uses DeepSpeed ZeRO configurations to manage multi-GPU memory during training. Two config files are provided:

zero2.json -- Used for Stage 1 pretraining (feature alignment). Enables optimizer state and gradient partitioning (ZeRO Stage 2), which is sufficient when only the ~30M-parameter projector is being trained.
zero3.json -- Used for Stage 2 finetuning (visual instruction tuning). Enables full parameter partitioning (ZeRO Stage 3), necessary when the entire 13B-parameter LLM is unfrozen for end-to-end training.

Both configurations enable BF16 mixed precision via the "bf16" block with "enabled": "auto", which defers to the CLI argument --bf16 True. Batch sizes and gradient accumulation steps are set to "auto", inheriting values from HuggingFace Trainer CLI arguments.

Usage

Pass the appropriate configuration file via the --deepspeed CLI argument when launching training:

# Stage 1: Feature alignment pretraining with ZeRO-2
deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --tune_mm_mlp_adapter True \
    ...

# Stage 2: Visual instruction tuning with ZeRO-3
deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    ...

Code Reference

Source Location

Repository: https://github.com/haotian-liu/LLaVA
File: scripts/zero2.json (Stage 1 pretraining config)
File: scripts/zero3.json (Stage 2 finetuning config)

zero2.json Configuration

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto"
    }
}

zero3.json Configuration

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

Import

N/A -- These are JSON configuration files, not Python modules. They are consumed by the DeepSpeed runtime via the --deepspeed CLI argument.

I/O Contract

Inputs

**Input Contract**
Name	Type	Description
`--deepspeed`	str (CLI arg)	Path to the DeepSpeed JSON config file. Passed to the training script.
`--bf16 True`	bool (CLI arg)	Activates BF16 mixed precision. Resolves the `"auto"` setting in the JSON config.
`--per_device_train_batch_size`	int (CLI arg)	Per-GPU batch size. Resolves `"train_micro_batch_size_per_gpu": "auto"`.
`--gradient_accumulation_steps`	int (CLI arg)	Number of gradient accumulation steps. Resolves `"gradient_accumulation_steps": "auto"`.

Outputs

**Output Contract**
Name	Type	Description
Configured DeepSpeed engine	DeepSpeed Runtime	A fully initialized distributed training environment with the specified ZeRO stage, mixed precision, and communication optimizations.

Key Configuration Parameters

**Parameter Comparison: zero2.json vs zero3.json**
Parameter	zero2.json	zero3.json	Purpose
`zero_optimization.stage`	2	3	ZeRO partitioning level
`zero_optimization.overlap_comm`	`true`	`true`	Overlap communication with computation
`zero_optimization.contiguous_gradients`	`true`	`true`	Reduce memory fragmentation for gradients
`stage3_prefetch_bucket_size`	N/A	`"auto"`	Prefetch buffer for ZeRO-3 parameter gathering
`stage3_param_persistence_threshold`	N/A	`"auto"`	Small parameters kept on all GPUs
`stage3_gather_16bit_weights_on_model_save`	N/A	`true`	Gather full weights for checkpoint saving
`bf16.enabled`	`"auto"`	`"auto"`	BF16 mixed precision (resolved by CLI)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment