Implementation:Haotian liu LLaVA DeepSpeed ZeRO Configuration
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Configuration files for DeepSpeed ZeRO distributed training used by LLaVA's training pipeline. Two JSON configuration files control memory partitioning strategy, mixed precision settings, and communication optimization for the two-stage training process.
Description
LLaVA uses DeepSpeed ZeRO configurations to manage multi-GPU memory during training. Two config files are provided:
zero2.json-- Used for Stage 1 pretraining (feature alignment). Enables optimizer state and gradient partitioning (ZeRO Stage 2), which is sufficient when only the ~30M-parameter projector is being trained.zero3.json-- Used for Stage 2 finetuning (visual instruction tuning). Enables full parameter partitioning (ZeRO Stage 3), necessary when the entire 13B-parameter LLM is unfrozen for end-to-end training.
Both configurations enable BF16 mixed precision via the "bf16" block with "enabled": "auto", which defers to the CLI argument --bf16 True. Batch sizes and gradient accumulation steps are set to "auto", inheriting values from HuggingFace Trainer CLI arguments.
Usage
Pass the appropriate configuration file via the --deepspeed CLI argument when launching training:
# Stage 1: Feature alignment pretraining with ZeRO-2
deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero2.json \
--tune_mm_mlp_adapter True \
...
# Stage 2: Visual instruction tuning with ZeRO-3
deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero3.json \
...
Code Reference
Source Location
- Repository:
https://github.com/haotian-liu/LLaVA - File:
scripts/zero2.json(Stage 1 pretraining config) - File:
scripts/zero3.json(Stage 2 finetuning config)
zero2.json Configuration
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto"
}
}
zero3.json Configuration
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
Import
N/A -- These are JSON configuration files, not Python modules. They are consumed by the DeepSpeed runtime via the --deepspeed CLI argument.
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
--deepspeed |
str (CLI arg) | Path to the DeepSpeed JSON config file. Passed to the training script. |
--bf16 True |
bool (CLI arg) | Activates BF16 mixed precision. Resolves the "auto" setting in the JSON config.
|
--per_device_train_batch_size |
int (CLI arg) | Per-GPU batch size. Resolves "train_micro_batch_size_per_gpu": "auto".
|
--gradient_accumulation_steps |
int (CLI arg) | Number of gradient accumulation steps. Resolves "gradient_accumulation_steps": "auto".
|
Outputs
| Name | Type | Description |
|---|---|---|
| Configured DeepSpeed engine | DeepSpeed Runtime | A fully initialized distributed training environment with the specified ZeRO stage, mixed precision, and communication optimizations. |
Key Configuration Parameters
| Parameter | zero2.json | zero3.json | Purpose |
|---|---|---|---|
zero_optimization.stage |
2 | 3 | ZeRO partitioning level |
zero_optimization.overlap_comm |
true |
true |
Overlap communication with computation |
zero_optimization.contiguous_gradients |
true |
true |
Reduce memory fragmentation for gradients |
stage3_prefetch_bucket_size |
N/A | "auto" |
Prefetch buffer for ZeRO-3 parameter gathering |
stage3_param_persistence_threshold |
N/A | "auto" |
Small parameters kept on all GPUs |
stage3_gather_16bit_weights_on_model_save |
N/A | true |
Gather full weights for checkpoint saving |
bf16.enabled |
"auto" |
"auto" |
BF16 mixed precision (resolved by CLI) |