Principle:OpenGVLab InternVL Dynamic Resolution Preprocessing

Knowledge Sources	InternVL 2.5 InternVL 1.5 InternVL
Domains	Computer_Vision, Preprocessing, Vision_Language
Last Updated	2026-02-07 00:00 GMT

Overview

A resolution-preserving image preprocessing technique that partitions input images into aspect-ratio-aware tiles to handle arbitrary image sizes without distortion.

Description

Dynamic resolution preprocessing addresses a fundamental limitation of fixed-resolution vision transformers: forcing all images to a single resolution causes information loss for high-resolution inputs and wastes computation on low-resolution ones. Instead of resizing all images to a fixed size (e.g., 448x448), this technique:

Analyzes the input image's aspect ratio
Selects an optimal tiling configuration from a predefined set of candidate layouts
Splits the image into tiles that best match its original aspect ratio
Optionally adds a global thumbnail for coarse-grained context

This enables the model to process images at their native resolution while maintaining a fixed per-tile input size for the vision encoder. A 1024x768 image might be split into 2x2 tiles, while a 448x1792 panorama might use 1x4 tiles.

Usage

Use this principle when processing images for a vision-language model that supports multi-tile input. It is the appropriate strategy when:

Input images have diverse aspect ratios and resolutions
Fine-grained visual detail (OCR, charts, diagrams) must be preserved
The vision encoder expects fixed-size tile inputs
A balance between visual fidelity and computational cost is needed

Theoretical Basis

The dynamic resolution algorithm works in three phases:

Phase 1: Candidate enumeration

# Generate all valid (rows, cols) configurations within min_num..max_num tiles
candidates = [(r, c) for r in range(1, max_num+1) for c in range(1, max_num+1)
              if min_num <= r * c <= max_num]

Phase 2: Best-fit selection

# Score each candidate by aspect ratio match and area utilization
for (rows, cols) in candidates:
    candidate_aspect = (cols * tile_size) / (rows * tile_size)
    aspect_error = abs(log(candidate_aspect) - log(image_aspect))
    area_ratio = min(image_area, rows * cols * tile_area) / max(image_area, rows * cols * tile_area)
    score = aspect_error + (1 - area_ratio)
# Select configuration with minimum score

Phase 3: Tile extraction

# Resize image to (rows * tile_size, cols * tile_size) and split into tiles
resized = image.resize((cols * tile_size, rows * tile_size))
tiles = [resized.crop((c*tile_size, r*tile_size, (c+1)*tile_size, (r+1)*tile_size))
         for r in range(rows) for c in range(cols)]
if use_thumbnail:
    tiles.append(image.resize((tile_size, tile_size)))

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_Dynamic_Preprocess

Uses Heuristic

Heuristic:OpenGVLab_InternVL_Dynamic_Resolution_Tiling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment