Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Microsoft DeepSpeedExamples VisualChat Training Environment

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Multimodal, Computer_Vision, Infrastructure
Last Updated 2026-02-07 13:00 GMT

Overview

Linux environment with DeepSpeed >= 0.10.3, transformers == 4.33.3 (pinned), OpenCLIP, and multi-GPU NVIDIA hardware for training vision-language models combining QWen-VL encoders with LLaMA-2 decoders.

Description

This environment supports the DeepSpeed-VisualChat multimodal training pipeline, which composes a vision encoder (QWen-VL CLIP), a vision projection layer (ViT linear or perceiver), and a language model decoder (LLaMA-2 7B/13B/70B). Training uses ZeRO Stage 2/3 with LoRA for parameter-efficient fine-tuning. The environment requires specific pinned versions of transformers (4.33.3) and multiple dataset-specific dependencies for VQA, captioning, and dialogue datasets. Multi-GPU training is recommended for models with 70B decoders.

Usage

Use this environment for any multimodal vision-language training workflow using DeepSpeed-VisualChat. It is the mandatory prerequisite for the Extract_Qwen_VL, VisProjection, Create_DSVL_Model, Build_Dataset, DeepSpeed_Initialize_VisualChat, and Fuse_LoRA implementations.

System Requirements

Category Requirement Notes
OS Linux Ubuntu 20.04+ recommended
Hardware (7B decoder) 1x NVIDIA A100-40GB Single GPU possible with small batch size
Hardware (70B decoder) 8x NVIDIA A100-80GB Multi-GPU required for large decoder models
CPU RAM 64GB+ For dataset loading and preprocessing
Disk 100GB+ SSD For COCO images (18GB+), model weights, and other datasets

Dependencies

System Packages

  • CUDA Toolkit (11.x or 12.x)
  • NCCL (for multi-GPU distributed training)

Python Packages

  • `deepspeed` >= 0.10.3
  • `transformers` == 4.33.3 (pinned version)
  • `datasets` >= 2.8.0
  • `sentencepiece` >= 0.1.97
  • `protobuf` == 3.20.3
  • `accelerate` >= 0.15.0
  • `open_clip_torch`
  • `einops`
  • `einops_exts`
  • `transformers_stream_generator`
  • `termcolor`

Credentials

  • `CUDA_VISIBLE_DEVICES`: Set to control which GPUs are used (e.g., `0,1,2,3,4,5,6,7`)

Dataset downloads may require accounts for specific datasets (COCO, A-OKVQA, etc.).

Quick Install

# Install all required packages
pip install "deepspeed>=0.10.3" "transformers==4.33.3" "datasets>=2.8.0" \
    "sentencepiece>=0.1.97" "protobuf==3.20.3" "accelerate>=0.15.0" \
    open_clip_torch einops einops_exts transformers_stream_generator termcolor

Code Evidence

Requirements from `applications/DeepSpeed-VisualChat/requirements.txt`:

datasets>=2.8.0
sentencepiece>=0.1.97
protobuf==3.20.3
accelerate>=0.15.0
open_clip_torch
deepspeed>=0.10.3
einops
einops_exts
transformers==4.33.3
transformers_stream_generator
termcolor

Multi-GPU control from `eval/eval_scripts/run_batch.sh:16`:

#NOTE: to run multi-GPU, you simple do "export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7;"

Batch size configuration from `training/main.py:85-87`:

'train_micro_batch_size_per_gpu': args.per_device_train_batch_size,
'train_batch_size': args.per_device_train_batch_size * world_size * args.gradient_accumulation_steps,

Common Errors

Error Message Cause Solution
`ImportError: transformers version mismatch` Wrong transformers version Must use exactly `transformers==4.33.3`
`CUDA out of memory` 70B decoder with insufficient GPUs Use 8 GPUs with ZeRO Stage 3 and LoRA
`FileNotFoundError: dataset path` Dataset not downloaded to expected location Follow README dataset preparation instructions exactly
`RuntimeError: checkpoint global step not restored` Known issue with checkpoint loading Manual handling required; global step must be tracked separately

Compatibility Notes

  • Transformers Pinned: Must use exactly transformers==4.33.3; other versions may break model loading or generation
  • Supported Vision Encoders: QWen-VL CLIP (2B parameters)
  • Supported Language Decoders: LLaMA-2 7B/13B/70B
  • Multi-image Support: Up to 8 images per sample (`max_num_image_per_sample`)
  • Maximum Sequence Length: Default 4096 tokens (configurable via `--max_seq_len`)
  • Dataset Collation: Mixed datasets must use compatible collation functions

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment