Principle:Microsoft DeepSpeedExamples Vision Encoder Extraction
- Principle: Vision_Encoder_Extraction
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Vision_Encoder_Extraction |
| Sources | Paper: CLIP (https://arxiv.org/abs/2103.00020), Paper: Qwen-VL (https://arxiv.org/abs/2308.12966) |
| Domains | Computer_Vision, Multimodal, Transfer_Learning |
| Repository | Microsoft/DeepSpeedExamples |
| Application | DeepSpeed-VisualChat |
| Status | Active |
Overview
A technique for extracting pre-trained vision encoder weights from multimodal models to serve as standalone visual feature extractors.
Description
Multimodal models such as Qwen-VL-Chat combine a powerful vision encoder (a modified CLIP ViT-bigG variant) with a large language model within a single architecture. For downstream tasks or modular pipelines like DeepSpeed-VisualChat, it is often desirable to extract the vision encoder as an independent component that can be paired with different language decoders.
The extraction process involves several steps:
- Load the full multimodal model -- The entire Qwen-VL-Chat model is loaded into memory, including all language decoder weights, the vision encoder, and projection layers.
- Filter the state dictionary -- The model's
state_dict()is iterated, and only keys containing the substring'visual'are retained. This isolates the vision encoder's parameters from the language model parameters. - Strip the namespace prefix -- Keys matching the pattern
transformer.visual.*have thetransformer.visual.prefix removed, producing clean weight names compatible with a standalone ViT model (e.g.,transformer.visual.conv1.weightbecomesconv1.weight). - Exclude the projection layer -- The
transformer.visual.projweight is explicitly excluded from the saved state dictionary. This projection layer was trained specifically for Qwen-VL's internal multimodal alignment and would interfere with a new projection layer designed for the target language decoder.
The result is a set of pure vision encoder weights that can be loaded into a standalone VisionTransformer module.
Why extract rather than train from scratch?
Pre-trained vision encoders carry rich visual representations learned from billions of image-text pairs (e.g., CLIP's 400M or LAION-2B datasets). Extracting these weights provides:
- Transfer learning -- The encoder already understands objects, spatial relationships, textures, and high-level scene semantics.
- Computational savings -- Training a ViT-bigG from scratch requires enormous compute; extraction reuses existing investment.
- Flexibility -- The extracted encoder can be paired with any language decoder (LLaMA-2-7B, LLaMA-2-13B, OPT, etc.) via a new projection layer.
Theoretical Basis
Vision Transformer Feature Extraction
A Vision Transformer (ViT) partitions an input image into fixed-size patches and processes them through a transformer encoder:
F = ViT(image) in R^(num_patches x hidden_dim)
For example, with a 448x448 image and 14x14 patches:
- Number of patches = (448/14)^2 = 1024
- Hidden dimension = 1664 (for ViT-bigG), but Qwen-VL uses a final output dimension of 4096 via an internal attention pooling layer
The resulting feature tensor F serves as a sequence of "visual tokens" that can be consumed by a language model in the same way as text token embeddings.
CLIP Contrastive Pre-training
The CLIP framework (Radford et al., 2021) trains the vision encoder alongside a text encoder using a contrastive loss:
L_CLIP = -log( exp(sim(v_i, t_i) / tau) / sum_j exp(sim(v_i, t_j) / tau) )
where:
v_iis the image embedding from the vision encodert_iis the text embedding from the text encodertauis a learned temperature parametersim()denotes cosine similarity
This training objective ensures the vision encoder produces features that are semantically meaningful and aligned with natural language concepts.
Weight Filtering Logic
The extraction logic can be expressed as a dictionary comprehension:
save_dict = {
k.replace('transformer.visual.', ''): v
for k, v in model.state_dict().items()
if 'visual' in k and 'transformer.visual.proj' not in k
}
The exclusion of transformer.visual.proj is critical because this is a learned projection matrix that maps from the vision hidden dimension to Qwen-VL's specific joint embedding space. A new projection layer will be trained in DeepSpeed-VisualChat to bridge to the target language decoder's embedding space.
Key Considerations
- Model compatibility -- The extracted weights must match the architecture of the standalone
VisionTransformerclass in terms of layer count, hidden dimensions, attention heads, and intermediate sizes. - Configuration alignment -- DeepSpeed-VisualChat uses a "fake config" from
laion/CLIP-ViT-bigG-14-laion2B-39B-b160kto specify the architecture parameters (patch_size,hidden_size=1664,num_hidden_layers,num_attention_heads,intermediate_size), then overrideshidden_sizeto 4096 after loading to match Qwen-VL's output dimension. - Device management -- The full Qwen-VL model may require significant GPU memory for loading. The extraction script uses
device_map="cuda"and loads the model in eval mode. - Strict loading -- The extracted weights are loaded with
strict=Trueto ensure all parameters match exactly, catching any mismatch early.
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Extract_Qwen_VL -- The concrete script that performs the extraction
- Principle:Microsoft_DeepSpeedExamples_Vision_Language_Projection -- The projection layer that bridges the extracted encoder to a language model
- Principle:Microsoft_DeepSpeedExamples_Multimodal_Model_Composition -- The full model composition that uses the extracted encoder