Heuristic:OpenGVLab InternVL Dynamic Resolution Tiling

Knowledge Sources	OpenGVLab/InternVL InternVL image preprocessing
Domains	Computer_Vision, Optimization, Preprocessing
Last Updated	2026-02-07 14:00 GMT

Overview

Aspect-ratio-aware image tiling strategy that dynamically selects the optimal tile grid (1-12 tiles of 448x448) based on image dimensions, with an optional thumbnail tile for global context.

Description

InternVL uses a dynamic resolution preprocessing strategy that avoids the information loss from naive center-crop or fixed-resolution resizing. Instead, it enumerates all valid tile grid configurations (e.g., 1x1, 1x2, 2x1, 2x2, ..., up to `max_num` tiles), finds the aspect ratio closest to the original image, then resizes and splits the image into that grid of 448x448 tiles. An optional thumbnail (a downscaled copy of the full image) is appended to provide global context that individual tiles lack.

Usage

Apply this heuristic when processing images for any InternVL workflow. It is enabled by default in all training scripts via `--dynamic_image_size True`. Configure `--min_dynamic_patch` (default: 1) and `--max_dynamic_patch` (default: 12) to control the tile count range. Enable `--use_thumbnail True` to add the global context thumbnail.

The Insight (Rule of Thumb)

Action: Use `dynamic_preprocess()` with `min_num=1`, `max_num=12`, `image_size=448`, `use_thumbnail=True`.
Value: Tile count range of 1-12 (default). Each tile is 448x448 pixels.
Trade-off: More tiles preserve more detail but increase token count (and thus memory/compute). The thumbnail adds 1 extra tile but provides crucial global context.

Reasoning

Fixed-resolution preprocessing (e.g., center-crop to 448x448) discards spatial information, especially for high-resolution or wide/tall images. Dynamic tiling preserves aspect ratio and resolution while keeping individual tile sizes compatible with the ViT input. The closest-aspect-ratio selection minimizes padding waste. The thumbnail provides the model with a global view that individual tiles cannot, which is important for understanding spatial relationships in the full image.

The number of image tokens per tile is: `(448 / 14)^2 * (0.5)^2 = 256 tokens`. With 12 tiles + 1 thumbnail = 13 tiles = 3328 tokens per image.

Code Evidence

From `dataset.py:830-866`:

def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448,
                       use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1)
        for i in range(1, n + 1) for j in range(1, n + 1)
        if i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # resize and split into tiles
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
    resized_img = image.resize((target_width, target_height))
    # ... split into blocks ...

    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

Token count calculation from `modeling_internvl_chat.py:57`:

self.num_image_token = int((image_size // patch_size) ** 2 *
                           (config.downsample_ratio ** 2))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment