Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Microsoft DeepSpeedExamples VisualChat Multimodal Training

From Leeroopedia



Knowledge Sources
Domains LLMs, Multimodal, Computer_Vision, Fine_Tuning, Distributed_Training
Last Updated 2026-02-07 13:00 GMT

Overview

End-to-end process for training a multimodal vision-language model that supports multi-round, multi-image interleaved conversations, combining a vision encoder with a large language model decoder using DeepSpeed distributed training.

Description

This workflow trains a DeepSpeed VisualChat (DSVL) model that can process interleaved sequences of images and text in multi-round conversations. The model architecture connects a vision encoder (OpenCLIP or QWen-VL) to a language model decoder (LLaMA-2) via learnable projection layers, enabling visual understanding and language generation in a unified framework.

Goal: A multimodal chat model capable of answering questions about images, describing visual content, and engaging in multi-turn conversations involving multiple images.

Scope: Covers vision encoder selection, projection layer configuration, multi-dataset training with blending, LoRA parameter-efficient fine-tuning, and multi-modal causal attention.

Strategy: Uses a three-component architecture (vision encoder + projection layer + LLM decoder) with LoRA for efficient fine-tuning. Supports 10+ multimodal datasets blended during training. DeepSpeed ZeRO handles distributed training across multiple GPUs.

Usage

Execute this workflow when you need to create a vision-language model that can understand and discuss images in multi-turn conversations. This is appropriate when you have access to multimodal datasets (image-text pairs, VQA, dialogue) and want to combine a pretrained vision encoder with a pretrained language model for visual chat capabilities.

Execution Steps

Step 1: Vision Encoder Setup

Select and configure the vision encoder that will process input images into visual feature representations.

Key considerations:

  • Choose between OpenCLIP (clip-vit-large-patch14) or QWen-VL vision encoder
  • For QWen-VL, extract the visual component from the full model using the helper script
  • Configure image preprocessing (resolution, normalization) matching the selected encoder
  • The vision encoder produces fixed-size feature vectors for each image

Step 2: Projection Layer Configuration

Configure the learnable projection layers that map vision features into the language model's embedding space.

What happens:

  • Select projection type: linear, MLP, or perceiver-based
  • The projection layer bridges the dimensionality gap between vision encoder output and LLM input
  • For perceiver projection, configure the number of learnable query tokens
  • These projection weights are the primary trainable parameters during pretraining stage

Step 3: Language Model Loading

Load the pretrained language model decoder that will generate text responses conditioned on visual and textual inputs.

What happens:

  • Load a HuggingFace causal language model (e.g., LLaMA-2-7B or LLaMA-2-70B)
  • Configure tokenizer with appropriate special tokens for image placeholders
  • Optionally apply LoRA adapters to language model layers for parameter-efficient training
  • Set max sequence length (4096 tokens including image token positions)

Step 4: Multi_Dataset Preparation

Prepare and blend multiple multimodal datasets for training with configurable sample counts per dataset.

What happens:

  • Select from 10+ supported datasets: LLaVA, COCO Captions, VQA, A-OKVQA, OCR-VQA, MIMIC-IT variants, Sparkles dialogue
  • Each dataset provides image-text pairs formatted using the DeepSpeed Template (DST)
  • Configure per-dataset sample counts for training blend
  • Split data for training and evaluation using configurable ratios
  • Create distributed data loaders with proper sampling

Step 5: Distributed Training

Train the multimodal model using DeepSpeed with ZeRO optimization and optional multi-modal causal attention.

What happens:

  • Initialize DeepSpeed engine with ZeRO Stage 2 or 3 configuration
  • Configure mixed precision training (fp16 or bf16)
  • Enable gradient checkpointing for memory efficiency
  • Optionally enable Multi-Modal Causal Attention (MMCA) for improved cross-modal attention
  • Train with causal language modeling loss on interleaved image-text sequences
  • LoRA parameters and projection layers are updated; base model weights can be frozen
  • Log training metrics via TensorBoard

Step 6: Model Fusion and Export

Merge LoRA adapters with base model weights and export the final multimodal model.

What happens:

  • Merge trained LoRA adapter weights back into the base language model
  • Save the complete model (vision encoder + projection + language model) as a checkpoint
  • The exported model can be used with the interactive chat interface for inference
  • Support checkpoint resume for continued training

Execution Diagram

GitHub URL

Workflow Repository