Environment:Haotian liu LLaVA Python CUDA Training Environment

Knowledge Sources	Haotian-liu/LLaVA pyproject.toml cog.yaml
Domains	Infrastructure, Deep_Learning, Computer_Vision
Last Updated	2026-02-13 23:00 GMT

Overview

Linux (Ubuntu) environment with CUDA-capable GPU, Python 3.8+, PyTorch 2.1.2, Transformers 4.37.2, and DeepSpeed 0.12.6 for LLaVA multimodal model training and inference.

Description

This environment provides the full software stack required to train and run LLaVA (Large Language and Visual Assistant) models. It centers on PyTorch with CUDA support for GPU acceleration, the HuggingFace Transformers ecosystem for model management, and DeepSpeed for distributed training with ZeRO memory optimization. The vision component relies on CLIP via the `openai/clip-vit-large-patch14-336` model through the `timm` and `transformers` libraries. Quantized training (4-bit/8-bit) requires `bitsandbytes` and `peft` for LoRA/QLoRA support.

Usage

Use this environment for all training workflows (pretraining, finetuning, LoRA, QLoRA), all inference workflows (CLI, web demo, model worker), and all evaluation workflows (VQA, benchmark scoring). This is the primary environment for the entire LLaVA codebase.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	macOS and Windows have limited support (16-bit inference only, no quantization)
Hardware	NVIDIA GPU with CUDA support	Minimum 16GB VRAM recommended for 7B models; 40GB+ for 13B full finetuning
Hardware	Multi-GPU setup for distributed training	V1.5 scripts use 8x GPUs with DeepSpeed
Disk	100GB+ SSD	Model weights (~26GB for 13B), datasets, and checkpoints
Python	>= 3.8	3.10 recommended per conda setup instructions

Dependencies

System Packages

CUDA Toolkit (compatible with PyTorch 2.1.2)
`git-lfs` (for downloading model weights from HuggingFace Hub)

Python Packages (Core)

`torch` == 2.1.2
`torchvision` == 0.16.2
`transformers` == 4.37.2
`tokenizers` == 0.15.1
`sentencepiece` == 0.1.99
`accelerate` == 0.21.0
`peft` (any compatible version, for LoRA/QLoRA)
`bitsandbytes` (for 4-bit/8-bit quantization)
`einops` == 0.6.1
`einops-exts` == 0.0.4
`timm` == 0.6.13
`scikit-learn` == 1.2.2

Python Packages (Serving)

`gradio` == 4.16.0
`gradio_client` == 0.8.1
`fastapi`
`uvicorn`
`requests`
`httpx` == 0.24.0
`pydantic`
`markdown2[all]`
`shortuuid`
`numpy`

Python Packages (Training)

`deepspeed` == 0.12.6
`ninja`
`wandb` (for experiment tracking)

Optional Packages

`flash_attn` (for Flash Attention memory-efficient training via `train_mem.py`)
`xformers` (for xformers attention optimization via `train_xformers.py`)
`ray` (for parallel GPT-4 evaluation)
`openai` (for GPT-4-based evaluation scoring)

Credentials

The following environment variables may be required depending on workflow:

`OPENAI_API_KEY`: Required for GPT-4-based evaluation scoring (`eval_gpt_review.py`, `eval_gpt_review_bench.py`, `eval_gpt_review_visual.py`)
`WANDB_API_KEY`: Required when `--report_to wandb` is set in training scripts
HuggingFace model access: Model weights are loaded from HuggingFace Hub; gated models may require `HF_TOKEN`

Quick Install

# Create conda environment
conda create -n llava python=3.10 -y
conda activate llava

# Install core package
pip install --upgrade pip
pip install -e .

# Install training dependencies
pip install -e ".[train]"

# Optional: Flash Attention 2 (requires CUDA)
pip install flash-attn --no-build-isolation

# Optional: xformers
pip install xformers

Code Evidence

Pinned dependency versions from `pyproject.toml:15-23`:

dependencies = [
    "torch==2.1.2", "torchvision==0.16.2",
    "transformers==4.37.2", "tokenizers==0.15.1", "sentencepiece==0.1.99", "shortuuid",
    "accelerate==0.21.0", "peft", "bitsandbytes",
    "pydantic", "markdown2[all]", "numpy", "scikit-learn==1.2.2",
    "gradio==4.16.0", "gradio_client==0.8.1",
    "requests", "httpx==0.24.0", "uvicorn", "fastapi",
    "einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13",
]

Training extras from `pyproject.toml:25-27`:

[project.optional-dependencies]
train = ["deepspeed==0.12.6", "ninja", "wandb"]

Tokenizer version check from `train.py:49-50`:

from packaging import version
IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse('0.14')

Flash Attention 2 integration from `train_mem.py:1-4`:

from llava.train.train import train

if __name__ == "__main__":
    train(attn_implementation="flash_attention_2")

GPU requirement for Flash Attention from `llama_flash_attn_monkey_patch.py:106-111`:

def replace_llama_attn_with_flash_attn():
    cuda_major, cuda_minor = torch.cuda.get_device_capability()
    if cuda_major < 8:
        warnings.warn(
            "Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward."
        )

Common Errors

Error Message	Cause	Solution
`ImportError: flash_attn not found`	flash-attn package not installed	`pip install flash-attn --no-build-isolation`
`ImportError: xformers not found`	xformers package not installed	`pip install xformers`
`CUDA out of memory`	Insufficient GPU VRAM for model size	Use `--load-4bit` or `--load-8bit` for inference; use LoRA/QLoRA for training
Flash attention warning on non-A100 GPU	GPU compute capability < 8.0	Use A100 or H100 GPU, or use standard attention (`train.py` instead of `train_mem.py`)
Tokenization mismatch warning	Tokenizer version incompatibility	Ensure `tokenizers==0.15.1` as pinned in pyproject.toml

Compatibility Notes

macOS: Only 16-bit inference supported. Specify `--device mps` for Metal acceleration. Must uninstall `bitsandbytes` (`pip uninstall bitsandbytes`). Quantization (4-bit, 8-bit) is NOT supported.
Windows: Only 16-bit inference supported. Must use CUDA 11.7 PyTorch builds. Must uninstall `bitsandbytes`. Quantization NOT supported. WSL2 recommended for full support.
Intel: Experimental support for Intel GPU Max Series and Sapphire Rapids CPUs via Intel Extension for PyTorch. See the `intel` branch.
Training scripts: V1 scripts (`scripts/pretrain.sh`, `scripts/finetune.sh`) are for original LLaVA. V1.5 scripts are in `scripts/v1_5/`. Do not mix them.
DeepSpeed: ZeRO-2 (`zero2.json`) used for pretraining and LoRA; ZeRO-3 (`zero3.json`) used for full finetuning. ZeRO-3 with CPU offload available for memory-constrained setups.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment