Environment:Microsoft DeepSpeedExamples VisualChat Training Environment

Knowledge Sources	DeepSpeed-VisualChat VisualChat README
Domains	Deep_Learning, Multimodal, Computer_Vision, Infrastructure
Last Updated	2026-02-07 13:00 GMT

Overview

Linux environment with DeepSpeed >= 0.10.3, transformers == 4.33.3 (pinned), OpenCLIP, and multi-GPU NVIDIA hardware for training vision-language models combining QWen-VL encoders with LLaMA-2 decoders.

Description

This environment supports the DeepSpeed-VisualChat multimodal training pipeline, which composes a vision encoder (QWen-VL CLIP), a vision projection layer (ViT linear or perceiver), and a language model decoder (LLaMA-2 7B/13B/70B). Training uses ZeRO Stage 2/3 with LoRA for parameter-efficient fine-tuning. The environment requires specific pinned versions of transformers (4.33.3) and multiple dataset-specific dependencies for VQA, captioning, and dialogue datasets. Multi-GPU training is recommended for models with 70B decoders.

Usage

Use this environment for any multimodal vision-language training workflow using DeepSpeed-VisualChat. It is the mandatory prerequisite for the Extract_Qwen_VL, VisProjection, Create_DSVL_Model, Build_Dataset, DeepSpeed_Initialize_VisualChat, and Fuse_LoRA implementations.

System Requirements

Category	Requirement	Notes
OS	Linux	Ubuntu 20.04+ recommended
Hardware (7B decoder)	1x NVIDIA A100-40GB	Single GPU possible with small batch size
Hardware (70B decoder)	8x NVIDIA A100-80GB	Multi-GPU required for large decoder models
CPU RAM	64GB+	For dataset loading and preprocessing
Disk	100GB+ SSD	For COCO images (18GB+), model weights, and other datasets

Dependencies

System Packages

CUDA Toolkit (11.x or 12.x)
NCCL (for multi-GPU distributed training)

Python Packages

`deepspeed` >= 0.10.3
`transformers` == 4.33.3 (pinned version)
`datasets` >= 2.8.0
`sentencepiece` >= 0.1.97
`protobuf` == 3.20.3
`accelerate` >= 0.15.0
`open_clip_torch`
`einops`
`einops_exts`
`transformers_stream_generator`
`termcolor`

Credentials

`CUDA_VISIBLE_DEVICES`: Set to control which GPUs are used (e.g., `0,1,2,3,4,5,6,7`)

Dataset downloads may require accounts for specific datasets (COCO, A-OKVQA, etc.).

Quick Install

# Install all required packages
pip install "deepspeed>=0.10.3" "transformers==4.33.3" "datasets>=2.8.0" \
    "sentencepiece>=0.1.97" "protobuf==3.20.3" "accelerate>=0.15.0" \
    open_clip_torch einops einops_exts transformers_stream_generator termcolor

Code Evidence

Requirements from `applications/DeepSpeed-VisualChat/requirements.txt`:

datasets>=2.8.0
sentencepiece>=0.1.97
protobuf==3.20.3
accelerate>=0.15.0
open_clip_torch
deepspeed>=0.10.3
einops
einops_exts
transformers==4.33.3
transformers_stream_generator
termcolor

Multi-GPU control from `eval/eval_scripts/run_batch.sh:16`:

#NOTE: to run multi-GPU, you simple do "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7;"

Batch size configuration from `training/main.py:85-87`:

'train_micro_batch_size_per_gpu': args.per_device_train_batch_size,
'train_batch_size': args.per_device_train_batch_size * world_size * args.gradient_accumulation_steps,

Common Errors

Error Message	Cause	Solution
`ImportError: transformers version mismatch`	Wrong transformers version	Must use exactly `transformers==4.33.3`
`CUDA out of memory`	70B decoder with insufficient GPUs	Use 8 GPUs with ZeRO Stage 3 and LoRA
`FileNotFoundError: dataset path`	Dataset not downloaded to expected location	Follow README dataset preparation instructions exactly
`RuntimeError: checkpoint global step not restored`	Known issue with checkpoint loading	Manual handling required; global step must be tracked separately

Compatibility Notes

Transformers Pinned: Must use exactly transformers==4.33.3; other versions may break model loading or generation
Supported Vision Encoders: QWen-VL CLIP (2B parameters)
Supported Language Decoders: LLaMA-2 7B/13B/70B
Multi-image Support: Up to 8 images per sample (`max_num_image_per_sample`)
Maximum Sequence Length: Default 4096 tokens (configurable via `--max_seq_len`)
Dataset Collation: Mixed datasets must use compatible collation functions

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment