Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq InternVL Media Processing

From Leeroopedia
Knowledge Sources
Domains Vision, Preprocessing
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for loading and preprocessing images and videos into the tiled patch format expected by the InternViT vision encoder.

Description

Provides utility functions for the InternVL3 media pipeline. dynamic_preprocess splits an image into multiple tiles based on its aspect ratio, finding the closest matching grid layout (e.g., 2x3) from allowed configurations, then crops the resized image into equal-sized patches with an optional thumbnail. build_transform creates a standard ImageNet normalization pipeline with bicubic resize. load_video uses the decord library to extract uniformly sampled frames from a video file, applying the same dynamic preprocessing to each frame. find_closest_aspect_ratio selects the optimal tile grid by minimizing aspect ratio difference.

Usage

Import load_image and load_video when preparing media inputs for InternVL3 inference. These functions handle the complete pipeline from file path to tensor.

Code Reference

Source Location

Signature

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size) -> torchvision.transforms.Compose:
    """Create ImageNet normalization pipeline with bicubic resize."""

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size) -> Tuple[int, int]:
    """Find aspect ratio closest to target considering area."""

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False) -> List[PIL.Image]:
    """Split image into tiles based on aspect ratio."""

def load_image(image_file, input_size=448, max_num=12) -> torch.Tensor:
    """Load image, apply dynamic tiling, and return stacked tensor."""

def get_index(bound, fps, max_frame, first_idx=0, num_segments=32) -> np.ndarray:
    """Generate frame indices for uniform video sampling."""

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32) -> Tuple[List[torch.Tensor], List[int]]:
    """Load video, sample frames, and apply dynamic preprocessing."""

Import

from tinychat.models.internvl.media import load_image, load_video

I/O Contract

Inputs

Name Type Required Description
image_file str Yes (for load_image) Path to image file
video_path str Yes (for load_video) Path to video file
input_size int No Target patch resolution (default: 448)
max_num int No Maximum number of tiles/patches (default: 12 for images, 1 for video frames)
num_segments int No Number of video frames to sample (default: 32)
use_thumbnail bool No Whether to append a thumbnail of the full image (default: False)

Outputs

Name Type Description
load_image returns torch.Tensor Stacked patch tensor of shape (num_patches, 3, input_size, input_size)
load_video returns Tuple[List[torch.Tensor], List[int]] List of frame tensors and per-frame patch counts
dynamic_preprocess returns List[PIL.Image] List of cropped image patches

Usage Examples

Load and Process Image

from tinychat.models.internvl.media import load_image

# Load image with dynamic tiling (up to 12 patches)
pixel_values = load_image("photo.jpg", input_size=448, max_num=12)
# pixel_values shape: (num_patches, 3, 448, 448)
pixel_values = pixel_values.cuda().half()

Load and Process Video

from tinychat.models.internvl.media import load_video

# Load video with 32 uniformly sampled frames
pixel_values_list, num_patches_list = load_video(
    "video.mp4", input_size=448, max_num=1, num_segments=32
)
# Each element: (num_patches, 3, 448, 448)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment