Principle:Mit han lab Llm awq Dynamic Image Video Preprocessing

Knowledge Sources	InternVL
Domains	Vision, Preprocessing
Last Updated	2026-02-15 00:00 GMT

Overview

Principle of dynamically tiling images into aspect-ratio-aware patches and uniformly sampling video frames for vision transformer processing.

Description

Dynamic image preprocessing adapts to varying image aspect ratios by selecting the closest matching tile grid (e.g., 2x3, 1x4) from a set of allowed configurations, then splitting the image into equal-sized patches. This preserves spatial information better than naive center-cropping or stretching. For videos, uniform temporal sampling extracts representative frames which are individually preprocessed. All patches undergo ImageNet normalization.

Usage

Apply this principle when preparing media inputs for vision transformers that process fixed-size patch sequences, especially when input images have diverse aspect ratios.

Theoretical Basis

Given an image with aspect ratio r = w/h, find the grid (m, n) that minimizes |m/n - r| subject to m*n <= max_patches. Resize the image to (m * patch_size, n * patch_size), then crop into m*n patches of size (patch_size, patch_size).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment