Heuristic:OpenGVLab InternVL Dynamic Resolution Tiling
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Optimization, Preprocessing |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Aspect-ratio-aware image tiling strategy that dynamically selects the optimal tile grid (1-12 tiles of 448x448) based on image dimensions, with an optional thumbnail tile for global context.
Description
InternVL uses a dynamic resolution preprocessing strategy that avoids the information loss from naive center-crop or fixed-resolution resizing. Instead, it enumerates all valid tile grid configurations (e.g., 1x1, 1x2, 2x1, 2x2, ..., up to `max_num` tiles), finds the aspect ratio closest to the original image, then resizes and splits the image into that grid of 448x448 tiles. An optional thumbnail (a downscaled copy of the full image) is appended to provide global context that individual tiles lack.
Usage
Apply this heuristic when processing images for any InternVL workflow. It is enabled by default in all training scripts via `--dynamic_image_size True`. Configure `--min_dynamic_patch` (default: 1) and `--max_dynamic_patch` (default: 12) to control the tile count range. Enable `--use_thumbnail True` to add the global context thumbnail.
The Insight (Rule of Thumb)
- Action: Use `dynamic_preprocess()` with `min_num=1`, `max_num=12`, `image_size=448`, `use_thumbnail=True`.
- Value: Tile count range of 1-12 (default). Each tile is 448x448 pixels.
- Trade-off: More tiles preserve more detail but increase token count (and thus memory/compute). The thumbnail adds 1 extra tile but provides crucial global context.
Reasoning
Fixed-resolution preprocessing (e.g., center-crop to 448x448) discards spatial information, especially for high-resolution or wide/tall images. Dynamic tiling preserves aspect ratio and resolution while keeping individual tile sizes compatible with the ViT input. The closest-aspect-ratio selection minimizes padding waste. The thumbnail provides the model with a global view that individual tiles cannot, which is important for understanding spatial relationships in the full image.
The number of image tokens per tile is: `(448 / 14)^2 * (0.5)^2 = 256 tokens`. With 12 tiles + 1 thumbnail = 13 tiles = 3328 tokens per image.
Code Evidence
From `dataset.py:830-866`:
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448,
use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1)
for i in range(1, n + 1) for j in range(1, n + 1)
if i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# resize and split into tiles
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
resized_img = image.resize((target_width, target_height))
# ... split into blocks ...
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
Token count calculation from `modeling_internvl_chat.py:57`:
self.num_image_token = int((image_size // patch_size) ** 2 *
(config.downsample_ratio ** 2))