Implementation:Mlc ai Mlc llm Image Processing

Overview

The Image Processing module implements GPU-accelerated image preprocessing operations as TVM TIR (Tensor IR) scheduled kernels. It is located at python/mlc_llm/model/vision/image_processing.py (286 lines).

The ImageProcessor class provides methods for image resizing, center cropping, rescaling, normalization, and padding -- all implemented as GPU-scheduled TIR primitive functions. These operations run on the GPU using thread binding for parallel execution and are designed to preprocess images before feeding them into vision encoders like CLIP.

Source File

File: python/mlc_llm/model/vision/image_processing.py
Lines: 286
Module: mlc_llm.model.vision.image_processing

Dependencies

Import	Purpose
`tvm.s_tir`	TVM scheduled TIR for applying GPU schedules to primitive functions
`tvm.tir`	TVM TIR for symbolic computation (Select, div, ceil, sqrt, etc.)
`tvm.relax.frontend.nn`	Neural network module base class and tensor types
`tvm.relax.frontend.nn.op`	Operations including `interpolate` and `tensor_ir_op`
`tvm.script.tir`	TIR script decorator for defining primitive functions

Helper Function: _var

def _var(dtype, size=1):
    return T.alloc_buffer((size,), dtype, scope="local")

Allocates a small local buffer used as a thread-local variable in TIR kernels.

Class: ImageProcessor

All methods assume images in NCHW layout (batch, channels, height, width) with 3 color channels.

Method: apply_schedule

def apply_schedule(self, sch, block, bdx=32, tile=[32, 32]):
    loop_x, loop_y = sch.get_loops(block)[-2:]
    xo, xi = sch.split(loop_x, factors=[tile[0], None])
    yo, yi = sch.split(loop_y, factors=[tile[1], None])
    sch.reorder(xo, yo, xi, yi)
    t = sch.fuse(xo, yo)
    ty, tx = sch.split(t, factors=[None, bdx])
    sch.bind(ty, "threadIdx.y")
    sch.bind(tx, "threadIdx.x")

Applies a standard GPU tiling schedule to a TIR block. The last two loops (typically height and width) are split into tiles of 32x32, reordered, fused, and bound to CUDA thread indices. This schedule is reused by all the image processing kernels.

Method: resize

def resize(self, image: Tensor, params):

Resizes an image using bilinear interpolation via op.interpolate. Supports three parameter modes:

Mode	Parameters	Behavior
Explicit size	`height`, `width`	Resizes to exact dimensions
Shortest edge	`shortest_edge`	Scales the image so the shorter dimension matches the target, maintaining aspect ratio
HD transform	`hd_transform`	Computes optimal scale factor for high-definition processing using `hd_num` (default 4) and `pad_num` (default 336) parameters

The HD transform mode computes the scaling as:

ratio = max(w, h) / min(w, h)
scale = ceil(sqrt(hd_num * ratio))
# Adjust scale if it would exceed hd_num tiles
if (scale * ceil(scale / ratio)) > hd_num:
    scale = scale - 1

Method: crop

def crop(self, image: Tensor, crop_size):

Performs a center crop on the image. The crop boundaries are computed as:

top = (orig_height - crop_height) // 2
bottom = orig_height - top
left = (orig_width - crop_width) // 2
right = orig_width - left

The crop is implemented as a TIR primitive function with GPU thread bindings on batch and channel dimensions (blockIdx.x and blockIdx.y), while the spatial dimensions are processed using the tiled schedule.

Method: rescale

def rescale(self, image: Tensor, rescale_factor=1/255.0, o_dtype="float32"):

Multiplies each pixel value by a rescale factor (default 1/255.0 to convert from [0, 255] to [0, 1]) and casts to the output dtype. Implemented as a GPU TIR kernel:

out_buf[n_idx, c_idx, h_idx, w_idx] = (
    T.cast(image_buf[n_idx, c_idx, h_idx, w_idx], o_dtype)
    * rescale_factor
)

Method: normalize

def normalize(self, image: Tensor, o_dtype="float32"):

Applies per-channel mean subtraction and standard deviation normalization using the ImageNet/CLIP standard normalization values:

Channel	Mean	Std Dev
R (channel 0)	0.48145466	0.26862954
G (channel 1)	0.4578275	0.26130258
B (channel 2)	0.40821073	0.27577711

The formula applied per pixel is:

output[n, c, h, w] = (input[n, c, h, w] - mean[c]) / stddev[c]

The mean and standard deviation values are initialized inside the TIR kernel using a T.init() block and stored in thread-local buffers.

Method: pad

def pad(self, image: Tensor, dtype="uint8"):

Pads the image vertically (top and bottom) so that the height becomes a multiple of 336 pixels. The padding value is 255 (white). Horizontal padding is set to zero (no padding).

h = image.shape[2]
tar = tir.truncdiv(h + 335, 336) * 336
t = tir.div(tar - h, 2)        # top padding
b = tar - h - t                  # bottom padding

The padding is distributed evenly between top and bottom. Padded regions are filled with the value 255, while non-padded regions copy from the original image.

Method: preprocess

def preprocess(self, pixel_values):
    return pixel_values

A pass-through method that returns the input unchanged. This serves as a hook for subclasses to override with model-specific preprocessing pipelines.

GPU Kernel Design Pattern

All TIR kernels in this module share a common structure:

Thread binding: Batch dimension bound to blockIdx.x, channel dimension bound to blockIdx.y.
Spatial processing: Height and width dimensions processed in a grid loop within a schedulable block (T.sblock).
Tiled scheduling: The apply_schedule method tiles the spatial dimensions and binds them to thread indices.
Attributes: All kernels are marked with tir.is_scheduled = 1 (pre-scheduled, no further auto-scheduling) and tir.noalias = True.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment