Implementation:Mit han lab Llm awq InternVL Media Processing

Knowledge Sources	Mit_han_lab_Llm_awq
Domains	Vision, Preprocessing
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for loading and preprocessing images and videos into the tiled patch format expected by the InternViT vision encoder.

Description

Provides utility functions for the InternVL3 media pipeline. dynamic_preprocess splits an image into multiple tiles based on its aspect ratio, finding the closest matching grid layout (e.g., 2x3) from allowed configurations, then crops the resized image into equal-sized patches with an optional thumbnail. build_transform creates a standard ImageNet normalization pipeline with bicubic resize. load_video uses the decord library to extract uniformly sampled frames from a video file, applying the same dynamic preprocessing to each frame. find_closest_aspect_ratio selects the optimal tile grid by minimizing aspect ratio difference.

Usage

Import load_image and load_video when preparing media inputs for InternVL3 inference. These functions handle the complete pipeline from file path to tensor.

Code Reference

Source Location

Repository: Mit_han_lab_Llm_awq
File: tinychat/models/internvl/media.py
Lines: 1-113

Signature

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size) -> torchvision.transforms.Compose:
    """Create ImageNet normalization pipeline with bicubic resize."""

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size) -> Tuple[int, int]:
    """Find aspect ratio closest to target considering area."""

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False) -> List[PIL.Image]:
    """Split image into tiles based on aspect ratio."""

def load_image(image_file, input_size=448, max_num=12) -> torch.Tensor:
    """Load image, apply dynamic tiling, and return stacked tensor."""

def get_index(bound, fps, max_frame, first_idx=0, num_segments=32) -> np.ndarray:
    """Generate frame indices for uniform video sampling."""

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32) -> Tuple[List[torch.Tensor], List[int]]:
    """Load video, sample frames, and apply dynamic preprocessing."""

Import

from tinychat.models.internvl.media import load_image, load_video

I/O Contract

Inputs

Name	Type	Required	Description
image_file	str	Yes (for load_image)	Path to image file
video_path	str	Yes (for load_video)	Path to video file
input_size	int	No	Target patch resolution (default: 448)
max_num	int	No	Maximum number of tiles/patches (default: 12 for images, 1 for video frames)
num_segments	int	No	Number of video frames to sample (default: 32)
use_thumbnail	bool	No	Whether to append a thumbnail of the full image (default: False)

Outputs

Name	Type	Description
load_image returns	torch.Tensor	Stacked patch tensor of shape (num_patches, 3, input_size, input_size)
load_video returns	Tuple[List[torch.Tensor], List[int]]	List of frame tensors and per-frame patch counts
dynamic_preprocess returns	List[PIL.Image]	List of cropped image patches

Usage Examples

Load and Process Image

from tinychat.models.internvl.media import load_image

# Load image with dynamic tiling (up to 12 patches)
pixel_values = load_image("photo.jpg", input_size=448, max_num=12)
# pixel_values shape: (num_patches, 3, 448, 448)
pixel_values = pixel_values.cuda().half()

Load and Process Video

from tinychat.models.internvl.media import load_video

# Load video with 32 uniformly sampled frames
pixel_values_list, num_patches_list = load_video(
    "video.mp4", input_size=448, max_num=1, num_segments=32
)
# Each element: (num_patches, 3, 448, 448)

Related Pages

Principle:Mit_han_lab_Llm_awq_Dynamic_Image_Video_Preprocessing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment