Implementation:Microsoft DeepSpeedExamples Extract Qwen VL
- Implementation: Extract_Qwen_VL
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation (Pattern Doc) |
| Title | Extract_Qwen_VL |
| Repository | Microsoft/DeepSpeedExamples |
| Application | DeepSpeed-VisualChat |
| File | applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py
|
| Lines | 1-14 |
| Language | Python |
| Status | Active |
Overview
Script for extracting the Qwen-VL vision encoder weights for standalone use in DeepSpeed-VisualChat.
Code Reference
The extraction script is a concise, self-contained Python file that loads the Qwen-VL-Chat model and saves only the vision encoder parameters:
Full Source (Lines 1-14)
from transformers import AutoModelForCausalLM
import torch
PATH = "Qwen/Qwen-VL-Chat"
model = AutoModelForCausalLM.from_pretrained(PATH, device_map="cuda", trust_remote_code=True).eval()
state_dict = model.state_dict()
save_dict = {}
for k, v in state_dict.items():
if 'visual' in k:
if 'transformer.visual.proj' not in k: # we don't need the proj layer
save_dict[k.replace('transformer.visual.', '')] = v
torch.save(save_dict, './qwen_clip/pytorch_model.bin')
Extraction Pattern
The script follows a four-step extraction pattern:
Step 1: Load the Full Model
model = AutoModelForCausalLM.from_pretrained(
PATH,
device_map="cuda",
trust_remote_code=True
).eval()
trust_remote_code=Trueis required because Qwen-VL uses custom modeling code not yet merged into the Hugging Face transformers library.device_map="cuda"loads the model onto GPU memory..eval()sets the model to evaluation mode (disables dropout, etc.).
Step 2: Filter State Dict for Vision Weights
state_dict = model.state_dict()
save_dict = {}
for k, v in state_dict.items():
if 'visual' in k:
...
The condition 'visual' in k selects only keys belonging to the vision encoder submodule. In the Qwen-VL architecture, all vision encoder parameters live under the transformer.visual. namespace.
Step 3: Strip Prefix and Exclude Projection
if 'transformer.visual.proj' not in k:
save_dict[k.replace('transformer.visual.', '')] = v
- Prefix stripping -- Removes
transformer.visual.so keys become compatible with a standaloneVisionTransformer(e.g.,transformer.visual.conv1.weightbecomesconv1.weight). - Projection exclusion -- The
transformer.visual.projlayer is an attention pooling projection specific to Qwen-VL's multimodal alignment. It is excluded because DeepSpeed-VisualChat uses its own projection layer (ViT linear or Perceiver cross-attention).
Step 4: Save the Extracted Weights
torch.save(save_dict, './qwen_clip/pytorch_model.bin')
The extracted weights are saved in PyTorch's standard .bin format to the qwen_clip/ directory.
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | Hugging Face model path | "Qwen/Qwen-VL-Chat" -- the full multimodal model
|
| Output | torch state dict file |
./qwen_clip/pytorch_model.bin -- vision encoder weights only
|
Output State Dict Structure
The saved dictionary contains keys such as:
| Key Pattern | Shape (approximate) | Description |
|---|---|---|
conv1.weight |
[1664, 3, 14, 14] |
Patch embedding convolution |
positional_embedding |
[1025, 1664] |
Positional embeddings (1024 patches + 1 CLS) |
transformer.resblocks.*.ln_*.weight |
[1664] |
Layer norm weights per block |
transformer.resblocks.*.attn.* |
varies | Self-attention parameters |
transformer.resblocks.*.mlp.* |
varies | MLP parameters |
ln_pre.weight |
[1664] |
Pre-transformer layer norm |
ln_post.weight |
[1664] |
Post-transformer layer norm |
The excluded key is:
attn_pool.*(derived fromtransformer.visual.proj) -- the attention pooling projection layer
Usage Example
Running the Extraction Script
# Ensure the output directory exists
mkdir -p qwen_clip
# Run the extraction
python applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py
Loading the Extracted Encoder in DeepSpeed-VisualChat
After extraction, the weights are loaded in create_dsvl_model_and_transforms:
vis_encoder = VisionTransformer(
image_size=448,
patch_size=vis_config.patch_size,
width=vis_config.hidden_size,
layers=vis_config.num_hidden_layers,
heads=vis_config.num_attention_heads,
mlp_size=vis_config.intermediate_size,
output_dim=4096,
)
vis_encoder.load_state_dict(
torch.load(os.path.join(args.vision_model_name_or_path, 'pytorch_model.bin'),
map_location='cpu'),
strict=True
)
Dependencies
transformers-- Hugging Face Transformers library (with remote code execution support)torch-- PyTorch for model loading and state dict manipulation
Related Pages
- Principle:Microsoft_DeepSpeedExamples_Vision_Encoder_Extraction -- The theoretical basis for vision encoder extraction
- Implementation:Microsoft_DeepSpeedExamples_VisProjection -- The projection layer applied to the extracted encoder's output
- Implementation:Microsoft_DeepSpeedExamples_Create_DSVL_Model -- The model composition that loads the extracted encoder
- Environment:Microsoft_DeepSpeedExamples_VisualChat_Training_Environment