Implementation:Mlc ai Mlc llm Image Processing
Overview
The Image Processing module implements GPU-accelerated image preprocessing operations as TVM TIR (Tensor IR) scheduled kernels. It is located at python/mlc_llm/model/vision/image_processing.py (286 lines).
The ImageProcessor class provides methods for image resizing, center cropping, rescaling, normalization, and padding -- all implemented as GPU-scheduled TIR primitive functions. These operations run on the GPU using thread binding for parallel execution and are designed to preprocess images before feeding them into vision encoders like CLIP.
Source File
- File:
python/mlc_llm/model/vision/image_processing.py - Lines: 286
- Module:
mlc_llm.model.vision.image_processing
Dependencies
| Import | Purpose |
|---|---|
tvm.s_tir |
TVM scheduled TIR for applying GPU schedules to primitive functions |
tvm.tir |
TVM TIR for symbolic computation (Select, div, ceil, sqrt, etc.) |
tvm.relax.frontend.nn |
Neural network module base class and tensor types |
tvm.relax.frontend.nn.op |
Operations including interpolate and tensor_ir_op
|
tvm.script.tir |
TIR script decorator for defining primitive functions |
Helper Function: _var
def _var(dtype, size=1):
return T.alloc_buffer((size,), dtype, scope="local")
Allocates a small local buffer used as a thread-local variable in TIR kernels.
Class: ImageProcessor
All methods assume images in NCHW layout (batch, channels, height, width) with 3 color channels.
Method: apply_schedule
def apply_schedule(self, sch, block, bdx=32, tile=[32, 32]):
loop_x, loop_y = sch.get_loops(block)[-2:]
xo, xi = sch.split(loop_x, factors=[tile[0], None])
yo, yi = sch.split(loop_y, factors=[tile[1], None])
sch.reorder(xo, yo, xi, yi)
t = sch.fuse(xo, yo)
ty, tx = sch.split(t, factors=[None, bdx])
sch.bind(ty, "threadIdx.y")
sch.bind(tx, "threadIdx.x")
Applies a standard GPU tiling schedule to a TIR block. The last two loops (typically height and width) are split into tiles of 32x32, reordered, fused, and bound to CUDA thread indices. This schedule is reused by all the image processing kernels.
Method: resize
def resize(self, image: Tensor, params):
Resizes an image using bilinear interpolation via op.interpolate. Supports three parameter modes:
| Mode | Parameters | Behavior |
|---|---|---|
| Explicit size | height, width |
Resizes to exact dimensions |
| Shortest edge | shortest_edge |
Scales the image so the shorter dimension matches the target, maintaining aspect ratio |
| HD transform | hd_transform |
Computes optimal scale factor for high-definition processing using hd_num (default 4) and pad_num (default 336) parameters
|
The HD transform mode computes the scaling as:
ratio = max(w, h) / min(w, h)
scale = ceil(sqrt(hd_num * ratio))
# Adjust scale if it would exceed hd_num tiles
if (scale * ceil(scale / ratio)) > hd_num:
scale = scale - 1
Method: crop
def crop(self, image: Tensor, crop_size):
Performs a center crop on the image. The crop boundaries are computed as:
top = (orig_height - crop_height) // 2
bottom = orig_height - top
left = (orig_width - crop_width) // 2
right = orig_width - left
The crop is implemented as a TIR primitive function with GPU thread bindings on batch and channel dimensions (blockIdx.x and blockIdx.y), while the spatial dimensions are processed using the tiled schedule.
Method: rescale
def rescale(self, image: Tensor, rescale_factor=1/255.0, o_dtype="float32"):
Multiplies each pixel value by a rescale factor (default 1/255.0 to convert from [0, 255] to [0, 1]) and casts to the output dtype. Implemented as a GPU TIR kernel:
out_buf[n_idx, c_idx, h_idx, w_idx] = (
T.cast(image_buf[n_idx, c_idx, h_idx, w_idx], o_dtype)
* rescale_factor
)
Method: normalize
def normalize(self, image: Tensor, o_dtype="float32"):
Applies per-channel mean subtraction and standard deviation normalization using the ImageNet/CLIP standard normalization values:
| Channel | Mean | Std Dev |
|---|---|---|
| R (channel 0) | 0.48145466 | 0.26862954 |
| G (channel 1) | 0.4578275 | 0.26130258 |
| B (channel 2) | 0.40821073 | 0.27577711 |
The formula applied per pixel is:
output[n, c, h, w] = (input[n, c, h, w] - mean[c]) / stddev[c]
The mean and standard deviation values are initialized inside the TIR kernel using a T.init() block and stored in thread-local buffers.
Method: pad
def pad(self, image: Tensor, dtype="uint8"):
Pads the image vertically (top and bottom) so that the height becomes a multiple of 336 pixels. The padding value is 255 (white). Horizontal padding is set to zero (no padding).
h = image.shape[2]
tar = tir.truncdiv(h + 335, 336) * 336
t = tir.div(tar - h, 2) # top padding
b = tar - h - t # bottom padding
The padding is distributed evenly between top and bottom. Padded regions are filled with the value 255, while non-padded regions copy from the original image.
Method: preprocess
def preprocess(self, pixel_values):
return pixel_values
A pass-through method that returns the input unchanged. This serves as a hook for subclasses to override with model-specific preprocessing pipelines.
GPU Kernel Design Pattern
All TIR kernels in this module share a common structure:
- Thread binding: Batch dimension bound to
blockIdx.x, channel dimension bound toblockIdx.y. - Spatial processing: Height and width dimensions processed in a grid loop within a schedulable block (
T.sblock). - Tiled scheduling: The
apply_schedulemethod tiles the spatial dimensions and binds them to thread indices. - Attributes: All kernels are marked with
tir.is_scheduled = 1(pre-scheduled, no further auto-scheduling) andtir.noalias = True.
Categories
- Vision Preprocessing
- GPU Kernels
- TVM TIR
- Image Processing
- CLIP Pipeline