Environment:Microsoft DeepSpeedExamples VisualChat Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Multimodal, Computer_Vision, Infrastructure |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
Linux environment with DeepSpeed >= 0.10.3, transformers == 4.33.3 (pinned), OpenCLIP, and multi-GPU NVIDIA hardware for training vision-language models combining QWen-VL encoders with LLaMA-2 decoders.
Description
This environment supports the DeepSpeed-VisualChat multimodal training pipeline, which composes a vision encoder (QWen-VL CLIP), a vision projection layer (ViT linear or perceiver), and a language model decoder (LLaMA-2 7B/13B/70B). Training uses ZeRO Stage 2/3 with LoRA for parameter-efficient fine-tuning. The environment requires specific pinned versions of transformers (4.33.3) and multiple dataset-specific dependencies for VQA, captioning, and dialogue datasets. Multi-GPU training is recommended for models with 70B decoders.
Usage
Use this environment for any multimodal vision-language training workflow using DeepSpeed-VisualChat. It is the mandatory prerequisite for the Extract_Qwen_VL, VisProjection, Create_DSVL_Model, Build_Dataset, DeepSpeed_Initialize_VisualChat, and Fuse_LoRA implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Ubuntu 20.04+ recommended |
| Hardware (7B decoder) | 1x NVIDIA A100-40GB | Single GPU possible with small batch size |
| Hardware (70B decoder) | 8x NVIDIA A100-80GB | Multi-GPU required for large decoder models |
| CPU RAM | 64GB+ | For dataset loading and preprocessing |
| Disk | 100GB+ SSD | For COCO images (18GB+), model weights, and other datasets |
Dependencies
System Packages
- CUDA Toolkit (11.x or 12.x)
- NCCL (for multi-GPU distributed training)
Python Packages
- `deepspeed` >= 0.10.3
- `transformers` == 4.33.3 (pinned version)
- `datasets` >= 2.8.0
- `sentencepiece` >= 0.1.97
- `protobuf` == 3.20.3
- `accelerate` >= 0.15.0
- `open_clip_torch`
- `einops`
- `einops_exts`
- `transformers_stream_generator`
- `termcolor`
Credentials
- `CUDA_VISIBLE_DEVICES`: Set to control which GPUs are used (e.g., `0,1,2,3,4,5,6,7`)
Dataset downloads may require accounts for specific datasets (COCO, A-OKVQA, etc.).
Quick Install
# Install all required packages
pip install "deepspeed>=0.10.3" "transformers==4.33.3" "datasets>=2.8.0" \
"sentencepiece>=0.1.97" "protobuf==3.20.3" "accelerate>=0.15.0" \
open_clip_torch einops einops_exts transformers_stream_generator termcolor
Code Evidence
Requirements from `applications/DeepSpeed-VisualChat/requirements.txt`:
datasets>=2.8.0
sentencepiece>=0.1.97
protobuf==3.20.3
accelerate>=0.15.0
open_clip_torch
deepspeed>=0.10.3
einops
einops_exts
transformers==4.33.3
transformers_stream_generator
termcolor
Multi-GPU control from `eval/eval_scripts/run_batch.sh:16`:
#NOTE: to run multi-GPU, you simple do "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7;"
Batch size configuration from `training/main.py:85-87`:
'train_micro_batch_size_per_gpu': args.per_device_train_batch_size,
'train_batch_size': args.per_device_train_batch_size * world_size * args.gradient_accumulation_steps,
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: transformers version mismatch` | Wrong transformers version | Must use exactly `transformers==4.33.3` |
| `CUDA out of memory` | 70B decoder with insufficient GPUs | Use 8 GPUs with ZeRO Stage 3 and LoRA |
| `FileNotFoundError: dataset path` | Dataset not downloaded to expected location | Follow README dataset preparation instructions exactly |
| `RuntimeError: checkpoint global step not restored` | Known issue with checkpoint loading | Manual handling required; global step must be tracked separately |
Compatibility Notes
- Transformers Pinned: Must use exactly transformers==4.33.3; other versions may break model loading or generation
- Supported Vision Encoders: QWen-VL CLIP (2B parameters)
- Supported Language Decoders: LLaMA-2 7B/13B/70B
- Multi-image Support: Up to 8 images per sample (`max_num_image_per_sample`)
- Maximum Sequence Length: Default 4096 tokens (configurable via `--max_seq_len`)
- Dataset Collation: Mixed datasets must use compatible collation functions
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Extract_Qwen_VL
- Implementation:Microsoft_DeepSpeedExamples_VisProjection
- Implementation:Microsoft_DeepSpeedExamples_Create_DSVL_Model
- Implementation:Microsoft_DeepSpeedExamples_Build_Dataset
- Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_VisualChat
- Implementation:Microsoft_DeepSpeedExamples_Fuse_LoRA