Workflow:Microsoft DeepSpeedExamples VisualChat Multimodal Training

Knowledge Sources	DeepSpeedExamples DeepSpeed Docs
Domains	LLMs, Multimodal, Computer_Vision, Fine_Tuning, Distributed_Training
Last Updated	2026-02-07 13:00 GMT

Overview

End-to-end process for training a multimodal vision-language model that supports multi-round, multi-image interleaved conversations, combining a vision encoder with a large language model decoder using DeepSpeed distributed training.

Description

This workflow trains a DeepSpeed VisualChat (DSVL) model that can process interleaved sequences of images and text in multi-round conversations. The model architecture connects a vision encoder (OpenCLIP or QWen-VL) to a language model decoder (LLaMA-2) via learnable projection layers, enabling visual understanding and language generation in a unified framework.

Goal: A multimodal chat model capable of answering questions about images, describing visual content, and engaging in multi-turn conversations involving multiple images.

Scope: Covers vision encoder selection, projection layer configuration, multi-dataset training with blending, LoRA parameter-efficient fine-tuning, and multi-modal causal attention.

Strategy: Uses a three-component architecture (vision encoder + projection layer + LLM decoder) with LoRA for efficient fine-tuning. Supports 10+ multimodal datasets blended during training. DeepSpeed ZeRO handles distributed training across multiple GPUs.

Usage

Execute this workflow when you need to create a vision-language model that can understand and discuss images in multi-turn conversations. This is appropriate when you have access to multimodal datasets (image-text pairs, VQA, dialogue) and want to combine a pretrained vision encoder with a pretrained language model for visual chat capabilities.

Execution Steps

Step 1: Vision Encoder Setup

Select and configure the vision encoder that will process input images into visual feature representations.

Key considerations:

Choose between OpenCLIP (clip-vit-large-patch14) or QWen-VL vision encoder
For QWen-VL, extract the visual component from the full model using the helper script
Configure image preprocessing (resolution, normalization) matching the selected encoder
The vision encoder produces fixed-size feature vectors for each image

Step 2: Projection Layer Configuration

Configure the learnable projection layers that map vision features into the language model's embedding space.

What happens:

Select projection type: linear, MLP, or perceiver-based
The projection layer bridges the dimensionality gap between vision encoder output and LLM input
For perceiver projection, configure the number of learnable query tokens
These projection weights are the primary trainable parameters during pretraining stage

Step 3: Language Model Loading

Load the pretrained language model decoder that will generate text responses conditioned on visual and textual inputs.

What happens:

Load a HuggingFace causal language model (e.g., LLaMA-2-7B or LLaMA-2-70B)
Configure tokenizer with appropriate special tokens for image placeholders
Optionally apply LoRA adapters to language model layers for parameter-efficient training
Set max sequence length (4096 tokens including image token positions)

Step 4: Multi_Dataset Preparation

Prepare and blend multiple multimodal datasets for training with configurable sample counts per dataset.

What happens:

Select from 10+ supported datasets: LLaVA, COCO Captions, VQA, A-OKVQA, OCR-VQA, MIMIC-IT variants, Sparkles dialogue
Each dataset provides image-text pairs formatted using the DeepSpeed Template (DST)
Configure per-dataset sample counts for training blend
Split data for training and evaluation using configurable ratios
Create distributed data loaders with proper sampling

Step 5: Distributed Training

Train the multimodal model using DeepSpeed with ZeRO optimization and optional multi-modal causal attention.

What happens:

Initialize DeepSpeed engine with ZeRO Stage 2 or 3 configuration
Configure mixed precision training (fp16 or bf16)
Enable gradient checkpointing for memory efficiency
Optionally enable Multi-Modal Causal Attention (MMCA) for improved cross-modal attention
Train with causal language modeling loss on interleaved image-text sequences
LoRA parameters and projection layers are updated; base model weights can be frozen
Log training metrics via TensorBoard

Step 6: Model Fusion and Export

Merge LoRA adapters with base model weights and export the final multimodal model.

What happens:

Merge trained LoRA adapter weights back into the base language model
Save the complete model (vision encoder + projection + language model) as a checkpoint
The exported model can be used with the interactive chat interface for inference
Support checkpoint resume for continued training

Execution Diagram

GitHub URL

Workflow Repository