Implementation:Microsoft DeepSpeedExamples Extract Qwen VL

Implementation: Extract_Qwen_VL

Metadata

Field	Value
Page Type	Implementation (Pattern Doc)
Title	Extract_Qwen_VL
Repository	Microsoft/DeepSpeedExamples
Application	DeepSpeed-VisualChat
File	`applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py`
Lines	1-14
Language	Python
Status	Active

Overview

Script for extracting the Qwen-VL vision encoder weights for standalone use in DeepSpeed-VisualChat.

Code Reference

The extraction script is a concise, self-contained Python file that loads the Qwen-VL-Chat model and saves only the vision encoder parameters:

Full Source (Lines 1-14)

from transformers import AutoModelForCausalLM
import torch

PATH = "Qwen/Qwen-VL-Chat"

model = AutoModelForCausalLM.from_pretrained(PATH, device_map="cuda", trust_remote_code=True).eval()

state_dict = model.state_dict()
save_dict = {}
for k, v in state_dict.items():
    if 'visual' in k:
        if 'transformer.visual.proj' not in k:  # we don't need the proj layer
            save_dict[k.replace('transformer.visual.', '')] = v
torch.save(save_dict, './qwen_clip/pytorch_model.bin')

Extraction Pattern

The script follows a four-step extraction pattern:

Step 1: Load the Full Model

model = AutoModelForCausalLM.from_pretrained(
    PATH,
    device_map="cuda",
    trust_remote_code=True
).eval()

trust_remote_code=True is required because Qwen-VL uses custom modeling code not yet merged into the Hugging Face transformers library.
device_map="cuda" loads the model onto GPU memory.
.eval() sets the model to evaluation mode (disables dropout, etc.).

Step 2: Filter State Dict for Vision Weights

state_dict = model.state_dict()
save_dict = {}
for k, v in state_dict.items():
    if 'visual' in k:
        ...

The condition 'visual' in k selects only keys belonging to the vision encoder submodule. In the Qwen-VL architecture, all vision encoder parameters live under the transformer.visual. namespace.

Step 3: Strip Prefix and Exclude Projection

if 'transformer.visual.proj' not in k:
    save_dict[k.replace('transformer.visual.', '')] = v

Prefix stripping -- Removes transformer.visual. so keys become compatible with a standalone VisionTransformer (e.g., transformer.visual.conv1.weight becomes conv1.weight).
Projection exclusion -- The transformer.visual.proj layer is an attention pooling projection specific to Qwen-VL's multimodal alignment. It is excluded because DeepSpeed-VisualChat uses its own projection layer (ViT linear or Perceiver cross-attention).

Step 4: Save the Extracted Weights

torch.save(save_dict, './qwen_clip/pytorch_model.bin')

The extracted weights are saved in PyTorch's standard .bin format to the qwen_clip/ directory.

I/O Contract

Direction	Type	Description
Input	Hugging Face model path	`"Qwen/Qwen-VL-Chat"` -- the full multimodal model
Output	`torch` state dict file	`./qwen_clip/pytorch_model.bin` -- vision encoder weights only

Output State Dict Structure

The saved dictionary contains keys such as:

Key Pattern	Shape (approximate)	Description
`conv1.weight`	`[1664, 3, 14, 14]`	Patch embedding convolution
`positional_embedding`	`[1025, 1664]`	Positional embeddings (1024 patches + 1 CLS)
`transformer.resblocks..ln_.weight`	`[1664]`	Layer norm weights per block
`transformer.resblocks..attn.`	varies	Self-attention parameters
`transformer.resblocks..mlp.`	varies	MLP parameters
`ln_pre.weight`	`[1664]`	Pre-transformer layer norm
`ln_post.weight`	`[1664]`	Post-transformer layer norm

The excluded key is:

attn_pool.* (derived from transformer.visual.proj) -- the attention pooling projection layer

Usage Example

Running the Extraction Script

# Ensure the output directory exists
mkdir -p qwen_clip

# Run the extraction
python applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py

Loading the Extracted Encoder in DeepSpeed-VisualChat

After extraction, the weights are loaded in create_dsvl_model_and_transforms:

vis_encoder = VisionTransformer(
    image_size=448,
    patch_size=vis_config.patch_size,
    width=vis_config.hidden_size,
    layers=vis_config.num_hidden_layers,
    heads=vis_config.num_attention_heads,
    mlp_size=vis_config.intermediate_size,
    output_dim=4096,
)
vis_encoder.load_state_dict(
    torch.load(os.path.join(args.vision_model_name_or_path, 'pytorch_model.bin'),
               map_location='cpu'),
    strict=True
)

Dependencies

transformers -- Hugging Face Transformers library (with remote code execution support)
torch -- PyTorch for model loading and state dict manipulation

Related Pages

Principle:Microsoft_DeepSpeedExamples_Vision_Encoder_Extraction -- The theoretical basis for vision encoder extraction
Implementation:Microsoft_DeepSpeedExamples_VisProjection -- The projection layer applied to the extracted encoder's output
Implementation:Microsoft_DeepSpeedExamples_Create_DSVL_Model -- The model composition that loads the extracted encoder
Environment:Microsoft_DeepSpeedExamples_VisualChat_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment