Environment:Haotian liu LLaVA Python CUDA Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, Computer_Vision |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Linux (Ubuntu) environment with CUDA-capable GPU, Python 3.8+, PyTorch 2.1.2, Transformers 4.37.2, and DeepSpeed 0.12.6 for LLaVA multimodal model training and inference.
Description
This environment provides the full software stack required to train and run LLaVA (Large Language and Visual Assistant) models. It centers on PyTorch with CUDA support for GPU acceleration, the HuggingFace Transformers ecosystem for model management, and DeepSpeed for distributed training with ZeRO memory optimization. The vision component relies on CLIP via the `openai/clip-vit-large-patch14-336` model through the `timm` and `transformers` libraries. Quantized training (4-bit/8-bit) requires `bitsandbytes` and `peft` for LoRA/QLoRA support.
Usage
Use this environment for all training workflows (pretraining, finetuning, LoRA, QLoRA), all inference workflows (CLI, web demo, model worker), and all evaluation workflows (VQA, benchmark scoring). This is the primary environment for the entire LLaVA codebase.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | macOS and Windows have limited support (16-bit inference only, no quantization) |
| Hardware | NVIDIA GPU with CUDA support | Minimum 16GB VRAM recommended for 7B models; 40GB+ for 13B full finetuning |
| Hardware | Multi-GPU setup for distributed training | V1.5 scripts use 8x GPUs with DeepSpeed |
| Disk | 100GB+ SSD | Model weights (~26GB for 13B), datasets, and checkpoints |
| Python | >= 3.8 | 3.10 recommended per conda setup instructions |
Dependencies
System Packages
- CUDA Toolkit (compatible with PyTorch 2.1.2)
- `git-lfs` (for downloading model weights from HuggingFace Hub)
Python Packages (Core)
- `torch` == 2.1.2
- `torchvision` == 0.16.2
- `transformers` == 4.37.2
- `tokenizers` == 0.15.1
- `sentencepiece` == 0.1.99
- `accelerate` == 0.21.0
- `peft` (any compatible version, for LoRA/QLoRA)
- `bitsandbytes` (for 4-bit/8-bit quantization)
- `einops` == 0.6.1
- `einops-exts` == 0.0.4
- `timm` == 0.6.13
- `scikit-learn` == 1.2.2
Python Packages (Serving)
- `gradio` == 4.16.0
- `gradio_client` == 0.8.1
- `fastapi`
- `uvicorn`
- `requests`
- `httpx` == 0.24.0
- `pydantic`
- `markdown2[all]`
- `shortuuid`
- `numpy`
Python Packages (Training)
- `deepspeed` == 0.12.6
- `ninja`
- `wandb` (for experiment tracking)
Optional Packages
- `flash_attn` (for Flash Attention memory-efficient training via `train_mem.py`)
- `xformers` (for xformers attention optimization via `train_xformers.py`)
- `ray` (for parallel GPT-4 evaluation)
- `openai` (for GPT-4-based evaluation scoring)
Credentials
The following environment variables may be required depending on workflow:
- `OPENAI_API_KEY`: Required for GPT-4-based evaluation scoring (`eval_gpt_review.py`, `eval_gpt_review_bench.py`, `eval_gpt_review_visual.py`)
- `WANDB_API_KEY`: Required when `--report_to wandb` is set in training scripts
- HuggingFace model access: Model weights are loaded from HuggingFace Hub; gated models may require `HF_TOKEN`
Quick Install
# Create conda environment
conda create -n llava python=3.10 -y
conda activate llava
# Install core package
pip install --upgrade pip
pip install -e .
# Install training dependencies
pip install -e ".[train]"
# Optional: Flash Attention 2 (requires CUDA)
pip install flash-attn --no-build-isolation
# Optional: xformers
pip install xformers
Code Evidence
Pinned dependency versions from `pyproject.toml:15-23`:
dependencies = [
"torch==2.1.2", "torchvision==0.16.2",
"transformers==4.37.2", "tokenizers==0.15.1", "sentencepiece==0.1.99", "shortuuid",
"accelerate==0.21.0", "peft", "bitsandbytes",
"pydantic", "markdown2[all]", "numpy", "scikit-learn==1.2.2",
"gradio==4.16.0", "gradio_client==0.8.1",
"requests", "httpx==0.24.0", "uvicorn", "fastapi",
"einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13",
]
Training extras from `pyproject.toml:25-27`:
[project.optional-dependencies]
train = ["deepspeed==0.12.6", "ninja", "wandb"]
Tokenizer version check from `train.py:49-50`:
from packaging import version
IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse('0.14')
Flash Attention 2 integration from `train_mem.py:1-4`:
from llava.train.train import train
if __name__ == "__main__":
train(attn_implementation="flash_attention_2")
GPU requirement for Flash Attention from `llama_flash_attn_monkey_patch.py:106-111`:
def replace_llama_attn_with_flash_attn():
cuda_major, cuda_minor = torch.cuda.get_device_capability()
if cuda_major < 8:
warnings.warn(
"Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward."
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: flash_attn not found` | flash-attn package not installed | `pip install flash-attn --no-build-isolation` |
| `ImportError: xformers not found` | xformers package not installed | `pip install xformers` |
| `CUDA out of memory` | Insufficient GPU VRAM for model size | Use `--load-4bit` or `--load-8bit` for inference; use LoRA/QLoRA for training |
| Flash attention warning on non-A100 GPU | GPU compute capability < 8.0 | Use A100 or H100 GPU, or use standard attention (`train.py` instead of `train_mem.py`) |
| Tokenization mismatch warning | Tokenizer version incompatibility | Ensure `tokenizers==0.15.1` as pinned in pyproject.toml |
Compatibility Notes
- macOS: Only 16-bit inference supported. Specify `--device mps` for Metal acceleration. Must uninstall `bitsandbytes` (`pip uninstall bitsandbytes`). Quantization (4-bit, 8-bit) is NOT supported.
- Windows: Only 16-bit inference supported. Must use CUDA 11.7 PyTorch builds. Must uninstall `bitsandbytes`. Quantization NOT supported. WSL2 recommended for full support.
- Intel: Experimental support for Intel GPU Max Series and Sapphire Rapids CPUs via Intel Extension for PyTorch. See the `intel` branch.
- Training scripts: V1 scripts (`scripts/pretrain.sh`, `scripts/finetune.sh`) are for original LLaVA. V1.5 scripts are in `scripts/v1_5/`. Do not mix them.
- DeepSpeed: ZeRO-2 (`zero2.json`) used for pretraining and LoRA; ZeRO-3 (`zero3.json`) used for full finetuning. ZeRO-3 with CPU offload available for memory-constrained setups.
Related Pages
- Implementation:Haotian_liu_LLaVA_Train_Stage1_Pretrain
- Implementation:Haotian_liu_LLaVA_LLaVATrainer_Train
- Implementation:Haotian_liu_LLaVA_Train_With_LoRA
- Implementation:Haotian_liu_LLaVA_Load_Pretrained_Model
- Implementation:Haotian_liu_LLaVA_ModelWorker_Class
- Implementation:Haotian_liu_LLaVA_CLI_Main
- Implementation:Haotian_liu_LLaVA_Build_Demo_Gradio
- Implementation:Haotian_liu_LLaVA_Model_VQA_Loader_Eval
- Implementation:Haotian_liu_LLaVA_DeepSpeed_ZeRO_Configuration
- Implementation:Haotian_liu_LLaVA_Process_Images
- Implementation:Haotian_liu_LLaVA_Model_Generate_Multimodal