Heuristic:OpenGVLab InternVL Pixel Shuffle Downsampling
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Optimization, Architecture |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Pixel shuffle downsampling with `scale_factor=0.5` reduces ViT output tokens by 4x while preserving information in the channel dimension, and `ps_version=v2` fixes a transposition bug in v1.
Description
After the InternViT vision encoder produces patch embeddings, a pixel shuffle operation reorganizes spatial dimensions into channels, effectively reducing the token count by a factor of `1/scale_factor^2`. With the default `scale_factor=0.5`, this means a 4x reduction: a 32x32 grid of 1024-dim tokens becomes a 16x16 grid of 4096-dim tokens. This is critical for keeping the total token count manageable when processing multiple high-resolution tiles. Important: `ps_version=v1` has a known bug where height and width are not swapped back after the shuffle, producing transposed spatial features. Always use `ps_version=v2`.
Usage
This heuristic is applied automatically during feature extraction in `InternVLChatModel.extract_feature()`. The `ps_version` parameter defaults to `v2` in all current training configurations. Only relevant if you are loading older checkpoints trained with `ps_version=v1`.
The Insight (Rule of Thumb)
- Action: Use `ps_version='v2'` and `downsample_ratio=0.5` (defaults).
- Value: 4x token reduction per tile (e.g., 1024 tokens to 256 tokens per 448x448 tile).
- Trade-off: Spatial resolution is halved in each dimension, but information is preserved in the expanded channel dimension via the MLP projector.
Reasoning
Without downsampling, each 448x448 image tile at patch_size=14 produces (448/14)^2 = 1024 tokens. With 12 tiles + thumbnail, that would be 13,312 tokens per image, which is prohibitively expensive for the language model. The pixel shuffle reduces this to 256 tokens per tile (3,328 total), making multimodal training feasible.
The v1 bug is documented via a `warnings.warn()` call in the source code. The transpose error means that spatial relationships along the height axis are swapped with the width axis, leading to degraded spatial understanding.
Code Evidence
From `modeling_internvl_chat.py:257-271`:
def pixel_shuffle(self, x, scale_factor=0.5):
n, w, h, c = x.size()
x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
x = x.permute(0, 2, 1, 3).contiguous()
x = x.view(n, int(h * scale_factor), int(w * scale_factor),
int(c / (scale_factor * scale_factor)))
if self.ps_version == 'v1':
warnings.warn("In ps_version 'v1', the height and width have "
"not been swapped back, which results in a "
"transposed image.")
else:
x = x.permute(0, 2, 1, 3).contiguous()
return x
Token count formula from `modeling_internvl_chat.py:57`:
self.num_image_token = int((image_size // patch_size) ** 2 *
(config.downsample_ratio ** 2))
# (448 // 14) ** 2 * (0.5) ** 2 = 1024 * 0.25 = 256 tokens per tile