Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenGVLab InternVL Dynamic Resolution Preprocessing

From Leeroopedia


Knowledge Sources
Domains Computer_Vision, Preprocessing, Vision_Language
Last Updated 2026-02-07 00:00 GMT

Overview

A resolution-preserving image preprocessing technique that partitions input images into aspect-ratio-aware tiles to handle arbitrary image sizes without distortion.

Description

Dynamic resolution preprocessing addresses a fundamental limitation of fixed-resolution vision transformers: forcing all images to a single resolution causes information loss for high-resolution inputs and wastes computation on low-resolution ones. Instead of resizing all images to a fixed size (e.g., 448x448), this technique:

  1. Analyzes the input image's aspect ratio
  2. Selects an optimal tiling configuration from a predefined set of candidate layouts
  3. Splits the image into tiles that best match its original aspect ratio
  4. Optionally adds a global thumbnail for coarse-grained context

This enables the model to process images at their native resolution while maintaining a fixed per-tile input size for the vision encoder. A 1024x768 image might be split into 2x2 tiles, while a 448x1792 panorama might use 1x4 tiles.

Usage

Use this principle when processing images for a vision-language model that supports multi-tile input. It is the appropriate strategy when:

  • Input images have diverse aspect ratios and resolutions
  • Fine-grained visual detail (OCR, charts, diagrams) must be preserved
  • The vision encoder expects fixed-size tile inputs
  • A balance between visual fidelity and computational cost is needed

Theoretical Basis

The dynamic resolution algorithm works in three phases:

Phase 1: Candidate enumeration

# Generate all valid (rows, cols) configurations within min_num..max_num tiles
candidates = [(r, c) for r in range(1, max_num+1) for c in range(1, max_num+1)
              if min_num <= r * c <= max_num]

Phase 2: Best-fit selection

# Score each candidate by aspect ratio match and area utilization
for (rows, cols) in candidates:
    candidate_aspect = (cols * tile_size) / (rows * tile_size)
    aspect_error = abs(log(candidate_aspect) - log(image_aspect))
    area_ratio = min(image_area, rows * cols * tile_area) / max(image_area, rows * cols * tile_area)
    score = aspect_error + (1 - area_ratio)
# Select configuration with minimum score

Phase 3: Tile extraction

# Resize image to (rows * tile_size, cols * tile_size) and split into tiles
resized = image.resize((cols * tile_size, rows * tile_size))
tiles = [resized.crop((c*tile_size, r*tile_size, (c+1)*tile_size, (r+1)*tile_size))
         for r in range(rows) for c in range(cols)]
if use_thumbnail:
    tiles.append(image.resize((tile_size, tile_size)))

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment