Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft DeepSpeedExamples Extract Qwen VL

From Leeroopedia


  1. Implementation: Extract_Qwen_VL

Metadata

Field Value
Page Type Implementation (Pattern Doc)
Title Extract_Qwen_VL
Repository Microsoft/DeepSpeedExamples
Application DeepSpeed-VisualChat
File applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py
Lines 1-14
Language Python
Status Active

Overview

Script for extracting the Qwen-VL vision encoder weights for standalone use in DeepSpeed-VisualChat.

Code Reference

The extraction script is a concise, self-contained Python file that loads the Qwen-VL-Chat model and saves only the vision encoder parameters:

Full Source (Lines 1-14)

from transformers import AutoModelForCausalLM
import torch

PATH = "Qwen/Qwen-VL-Chat"

model = AutoModelForCausalLM.from_pretrained(PATH, device_map="cuda", trust_remote_code=True).eval()

state_dict = model.state_dict()
save_dict = {}
for k, v in state_dict.items():
    if 'visual' in k:
        if 'transformer.visual.proj' not in k:  # we don't need the proj layer
            save_dict[k.replace('transformer.visual.', '')] = v
torch.save(save_dict, './qwen_clip/pytorch_model.bin')

Extraction Pattern

The script follows a four-step extraction pattern:

Step 1: Load the Full Model

model = AutoModelForCausalLM.from_pretrained(
    PATH,
    device_map="cuda",
    trust_remote_code=True
).eval()
  • trust_remote_code=True is required because Qwen-VL uses custom modeling code not yet merged into the Hugging Face transformers library.
  • device_map="cuda" loads the model onto GPU memory.
  • .eval() sets the model to evaluation mode (disables dropout, etc.).

Step 2: Filter State Dict for Vision Weights

state_dict = model.state_dict()
save_dict = {}
for k, v in state_dict.items():
    if 'visual' in k:
        ...

The condition 'visual' in k selects only keys belonging to the vision encoder submodule. In the Qwen-VL architecture, all vision encoder parameters live under the transformer.visual. namespace.

Step 3: Strip Prefix and Exclude Projection

if 'transformer.visual.proj' not in k:
    save_dict[k.replace('transformer.visual.', '')] = v
  • Prefix stripping -- Removes transformer.visual. so keys become compatible with a standalone VisionTransformer (e.g., transformer.visual.conv1.weight becomes conv1.weight).
  • Projection exclusion -- The transformer.visual.proj layer is an attention pooling projection specific to Qwen-VL's multimodal alignment. It is excluded because DeepSpeed-VisualChat uses its own projection layer (ViT linear or Perceiver cross-attention).

Step 4: Save the Extracted Weights

torch.save(save_dict, './qwen_clip/pytorch_model.bin')

The extracted weights are saved in PyTorch's standard .bin format to the qwen_clip/ directory.

I/O Contract

Direction Type Description
Input Hugging Face model path "Qwen/Qwen-VL-Chat" -- the full multimodal model
Output torch state dict file ./qwen_clip/pytorch_model.bin -- vision encoder weights only

Output State Dict Structure

The saved dictionary contains keys such as:

Key Pattern Shape (approximate) Description
conv1.weight [1664, 3, 14, 14] Patch embedding convolution
positional_embedding [1025, 1664] Positional embeddings (1024 patches + 1 CLS)
transformer.resblocks.*.ln_*.weight [1664] Layer norm weights per block
transformer.resblocks.*.attn.* varies Self-attention parameters
transformer.resblocks.*.mlp.* varies MLP parameters
ln_pre.weight [1664] Pre-transformer layer norm
ln_post.weight [1664] Post-transformer layer norm

The excluded key is:

  • attn_pool.* (derived from transformer.visual.proj) -- the attention pooling projection layer

Usage Example

Running the Extraction Script

# Ensure the output directory exists
mkdir -p qwen_clip

# Run the extraction
python applications/DeepSpeed-VisualChat/helper/extract_qwen_vl.py

Loading the Extracted Encoder in DeepSpeed-VisualChat

After extraction, the weights are loaded in create_dsvl_model_and_transforms:

vis_encoder = VisionTransformer(
    image_size=448,
    patch_size=vis_config.patch_size,
    width=vis_config.hidden_size,
    layers=vis_config.num_hidden_layers,
    heads=vis_config.num_attention_heads,
    mlp_size=vis_config.intermediate_size,
    output_dim=4096,
)
vis_encoder.load_state_dict(
    torch.load(os.path.join(args.vision_model_name_or_path, 'pytorch_model.bin'),
               map_location='cpu'),
    strict=True
)

Dependencies

  • transformers -- Hugging Face Transformers library (with remote code execution support)
  • torch -- PyTorch for model loading and state dict manipulation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment