Principle:Microsoft DeepSpeedExamples Multimodal Model Composition
- Principle: Multimodal_Model_Composition
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Multimodal_Model_Composition |
| Sources | Paper: LLaVA (https://arxiv.org/abs/2304.08485), Paper: DeepSpeed-VisualChat (https://arxiv.org/abs/2309.14327) |
| Domains | Multimodal, Model_Architecture, NLP |
| Repository | Microsoft/DeepSpeedExamples |
| Application | DeepSpeed-VisualChat |
| Status | Active |
Overview
An architecture pattern that composes a vision encoder, projection layer, and language decoder into a unified multimodal model for visual question answering and chat.
Description
The DeepSpeed-VisualChat model is a three-stage composed architecture that processes interleaved image and text inputs to produce natural language responses. Rather than training a monolithic multimodal model from scratch, this approach composes three pre-trained or purpose-built components:
Stage 1: Vision Encoder
The vision encoder processes raw images into dense feature sequences:
- Accepts images of fixed resolution (e.g., 224x224 for CLIP or 448x448 for Qwen-VL)
- Outputs a tensor of shape
[num_images, num_patches, vis_dim] - The encoder is frozen during training (no gradient computation)
Supported encoders:
- Standard CLIP models (e.g.,
openai/clip-vit-large-patch14) loaded viaCLIPVisionModel - Qwen-VL's modified CLIP (a ViT-bigG variant with output dimension 4096) loaded as a standalone
VisionTransformer
Stage 2: Projection Layer
Maps visual features from the vision encoder's dimension to the language decoder's embedding dimension:
- Three options:
baseline(Linear + LayerNorm),vit(CLIPEncoderLayer + Linear + LayerNorm),perceiver(cross-attention with learned queries) - The projection layer is always trainable
Stage 3: Language Decoder
A causal language model that processes the concatenated visual and text token embeddings:
- Currently supports LLaMA-2 family models
- Produces autoregressive text generation conditioned on visual context
- Can be fine-tuned with LoRA adapters for parameter efficiency
Concatenation and Interleaving
The core innovation is the interleaving of visual and text tokens within a single sequence:
Input sequence: [system_prompt] [### Image 1:] [vis_tokens_1] [### Question:] [text_tokens] [### Answer:]
For multi-image inputs:
Input sequence: [### Image 1:] [vis_tokens_1] [### Image 2:] [vis_tokens_2] [### Question:] [text] [### Answer:]
The <image> placeholder tokens in the text are replaced with the actual projected visual feature tensors at runtime. The concatenation process:
- Text tokens are embedded via the language model's embedding layer
<image>token positions are identified in the input IDs- Projected visual features are inserted at those positions, replacing the placeholder
- The resulting mixed embedding sequence is padded to a uniform length
Theoretical Basis
Visual Token Injection
The key theoretical insight is that projected visual features can be treated as additional "tokens" in the language model's input:
hidden_states = concat(
text_embed(tokens_before_image),
projection(vis_encoder(image)),
text_embed(tokens_after_image)
)
This works because:
- The projection layer maps visual features to the same dimensional space as text embeddings
- The language model's self-attention mechanism can attend to both visual and text tokens
- Causal masking ensures proper autoregressive generation
Multi-Modal Causal Attention (MMCA)
DeepSpeed-VisualChat introduces MMCA, a modified attention mechanism that distinguishes between visual and text tokens in the attention computation:
attention_mask values:
0 = padding (ignored)
1 = text token (standard causal attention)
2 = image token (visual attention pattern)
When enable_mmca_attention is set, the attention mechanism applies different masking patterns for image-to-text and text-to-image attention, similar to cross-attention but within a unified self-attention framework.
Loss Computation
The model computes cross-entropy loss only on the answer tokens, not on the instruction or image tokens:
labels = [-100, -100, ..., -100, answer_token_1, answer_token_2, ..., eos]
|--- instruction ---| |---------- answer region ---------|
loss = CrossEntropyLoss(logits[labels != -100], labels[labels != -100])
The -100 label value (matching PyTorch's CrossEntropyLoss ignore index) is used to mask out instruction tokens, image tokens, and padding from the loss computation.
Trainable vs. Frozen Components
| Component | Trainable? | Rationale |
|---|---|---|
| Vision Encoder | No (frozen) | Pre-trained features are sufficient; large parameter count |
| Projection Layer | Yes | Must learn to bridge the specific encoder-decoder pair |
| Language Embedding | Yes | Extended vocabulary for special tokens (<image>, etc.)
|
| Language Decoder (base) | No (frozen) | Pre-trained language capabilities preserved |
| Language Decoder (LoRA) | Yes (if enabled) | Small adapter weights for task-specific fine-tuning |
Key Considerations
- Memory efficiency -- The vision encoder runs in
torch.no_grad()mode when frozen, saving significant GPU memory by not storing activations for backpropagation. - Token limit -- The maximum sequence length (
max_seq_len, default 4096) must accommodate both visual tokens and text tokens. With CLIP ViT-L/14, each image contributes ~257 tokens; multiple images can quickly exhaust the context window. - Vocabulary extension -- Special tokens (
<image>,<im_patch>,<im_start>,<im_end>) are added to the tokenizer and the language model's embedding layer is resized accordingly. - Padding strategy -- Variable-length sequences (from different numbers of images) are padded using the padding token embedding, with padding on the right side and divisible-by-8 alignment for hardware efficiency.
- Gradient checkpointing -- Both the vision encoder and language decoder support gradient checkpointing to reduce memory usage during training at the cost of recomputation.
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Create_DSVL_Model -- The concrete model implementation
- Principle:Microsoft_DeepSpeedExamples_Vision_Encoder_Extraction -- How the vision encoder is obtained
- Principle:Microsoft_DeepSpeedExamples_Vision_Language_Projection -- How visual features are projected
- Principle:Microsoft_DeepSpeedExamples_Multi_Dataset_VQA_Preparation -- How training data is prepared for this model
- Principle:Microsoft_DeepSpeedExamples_Multimodal_Distributed_Training -- How this model is trained with DeepSpeed