Implementation:Mit han lab Llm awq InternVL Media Processing
| Knowledge Sources | |
|---|---|
| Domains | Vision, Preprocessing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for loading and preprocessing images and videos into the tiled patch format expected by the InternViT vision encoder.
Description
Provides utility functions for the InternVL3 media pipeline. dynamic_preprocess splits an image into multiple tiles based on its aspect ratio, finding the closest matching grid layout (e.g., 2x3) from allowed configurations, then crops the resized image into equal-sized patches with an optional thumbnail. build_transform creates a standard ImageNet normalization pipeline with bicubic resize. load_video uses the decord library to extract uniformly sampled frames from a video file, applying the same dynamic preprocessing to each frame. find_closest_aspect_ratio selects the optimal tile grid by minimizing aspect ratio difference.
Usage
Import load_image and load_video when preparing media inputs for InternVL3 inference. These functions handle the complete pipeline from file path to tensor.
Code Reference
Source Location
- Repository: Mit_han_lab_Llm_awq
- File: tinychat/models/internvl/media.py
- Lines: 1-113
Signature
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size) -> torchvision.transforms.Compose:
"""Create ImageNet normalization pipeline with bicubic resize."""
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size) -> Tuple[int, int]:
"""Find aspect ratio closest to target considering area."""
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False) -> List[PIL.Image]:
"""Split image into tiles based on aspect ratio."""
def load_image(image_file, input_size=448, max_num=12) -> torch.Tensor:
"""Load image, apply dynamic tiling, and return stacked tensor."""
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32) -> np.ndarray:
"""Generate frame indices for uniform video sampling."""
def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32) -> Tuple[List[torch.Tensor], List[int]]:
"""Load video, sample frames, and apply dynamic preprocessing."""
Import
from tinychat.models.internvl.media import load_image, load_video
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| image_file | str | Yes (for load_image) | Path to image file |
| video_path | str | Yes (for load_video) | Path to video file |
| input_size | int | No | Target patch resolution (default: 448) |
| max_num | int | No | Maximum number of tiles/patches (default: 12 for images, 1 for video frames) |
| num_segments | int | No | Number of video frames to sample (default: 32) |
| use_thumbnail | bool | No | Whether to append a thumbnail of the full image (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| load_image returns | torch.Tensor | Stacked patch tensor of shape (num_patches, 3, input_size, input_size) |
| load_video returns | Tuple[List[torch.Tensor], List[int]] | List of frame tensors and per-frame patch counts |
| dynamic_preprocess returns | List[PIL.Image] | List of cropped image patches |
Usage Examples
Load and Process Image
from tinychat.models.internvl.media import load_image
# Load image with dynamic tiling (up to 12 patches)
pixel_values = load_image("photo.jpg", input_size=448, max_num=12)
# pixel_values shape: (num_patches, 3, 448, 448)
pixel_values = pixel_values.cuda().half()
Load and Process Video
from tinychat.models.internvl.media import load_video
# Load video with 32 uniformly sampled frames
pixel_values_list, num_patches_list = load_video(
"video.mp4", input_size=448, max_num=1, num_segments=32
)
# Each element: (num_patches, 3, 448, 448)