Principle:Haotian liu LLaVA Environment Setup

**Metadata**
Knowledge Sources	ZeRO: Memory Optimizations Toward Training Trillion Parameter Models LLaVA
Domains	Distributed_Training Memory_Optimization
Last Updated	2026-02-13 00:00 GMT

Overview

Technique for configuring distributed deep learning training environments with mixed precision and memory-optimized parallelism. In the context of LLaVA, environment setup encompasses DeepSpeed ZeRO stage selection, Flash Attention installation, and Python dependency management for multi-GPU vision-language model training.

Description

Environment setup in the context of LLaVA training involves configuring DeepSpeed ZeRO (Zero Redundancy Optimizer) stages for memory-efficient distributed training, installing Flash Attention 2 for optimized attention computation, and setting up the required Python dependencies. The two ZeRO stages used by LLaVA serve distinct purposes:

ZeRO Stage 2 partitions optimizer states and gradients across GPUs. This is used during Stage 1 pretraining, where only the lightweight multimodal projector (~30M parameters) is trained and memory pressure is relatively low.
ZeRO Stage 3 additionally partitions model parameters across GPUs. This is used during Stage 2 finetuning, where the full language model is unfrozen and GPU memory requirements increase substantially.

Both configurations enable BF16 mixed precision training, which halves memory usage for activations and parameters compared to FP32 while maintaining a wider dynamic range than FP16. The training scripts invoke DeepSpeed via the deepspeed launcher, which handles process spawning, gradient synchronization, and checkpoint management across GPUs.

Key environment components include:

DeepSpeed -- distributed training framework providing ZeRO memory optimization
Flash Attention 2 -- fused CUDA kernel for memory-efficient attention (optional but recommended)
PyTorch -- with CUDA support and BF16 capability (Ampere GPUs or newer)
HuggingFace Transformers -- model loading and Trainer infrastructure
CLIP -- vision encoder requiring openai/clip-vit-large-patch14-336

Usage

Use this principle when setting up multi-GPU training for LLaVA or similar vision-language models. The choice between ZeRO stages maps directly to the training phase:

ZeRO-2 is appropriate for Stage 1 pretraining where only the projector is trained. Since the LLM and vision encoder are frozen, optimizer states are small and ZeRO-2's partitioning of optimizer states and gradients is sufficient.
ZeRO-3 is needed for Stage 2 full finetuning where the entire LLM is unfrozen. The full parameter set of a 13B model requires ZeRO-3's parameter partitioning to fit across 8 GPUs.

A typical setup requires 8x A100 80GB GPUs for LLaVA-v1.5-13B training.

Theoretical Basis

ZeRO eliminates memory redundancy in data parallelism by partitioning different categories of model state across data-parallel processes:

ZeRO Stage 1 -- partitions optimizer states (e.g., Adam's first and second moments)
ZeRO Stage 2 -- additionally partitions gradients during the backward pass
ZeRO Stage 3 -- additionally partitions model parameters themselves

This reduces per-GPU memory footprint from O(model_size) to O(model_size / num_gpus) for the partitioned components. The trade-off is increased communication volume: ZeRO-3 requires all-gather operations to reconstruct parameters for both forward and backward passes, while ZeRO-2 only requires reduce-scatter for gradients.

The memory consumption for a model with P parameters in mixed precision training can be approximated as:

Standard Data Parallelism:
    Memory per GPU = 2P (parameters) + 2P (gradients) + 12P (optimizer, Adam FP32)
                   = 16P bytes for a BF16 model with Adam optimizer

ZeRO Stage 2 (N GPUs):
    Memory per GPU = 2P (parameters) + (2P + 12P) / N
                   = 2P + 14P/N

ZeRO Stage 3 (N GPUs):
    Memory per GPU = (2P + 2P + 12P) / N
                   = 16P/N

For LLaVA-v1.5-13B with P ~= 13 billion and N = 8 GPUs, ZeRO-3 reduces per-GPU memory from ~208 GB (standard) to ~26 GB (partitioned), making training feasible on 80 GB A100 GPUs.

Related Pages

Implementation:Haotian_liu_LLaVA_DeepSpeed_ZeRO_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment