Principle:Mit han lab Llm awq Device Warmup and Kernel Tuning

Overview

Runtime optimization process that warms up GPU caches and auto-tunes quantized GEMM kernel parameters for optimal inference performance.

Description

Before running inference, two optimizations are applied:

Device Warmup

Device warmup runs dummy matrix multiplications to initialize GPU caches, memory allocators, and CUDA contexts, avoiding cold-start latency during actual inference. Without warmup, the first few inference calls exhibit significantly higher latency due to:

CUDA context initialization
GPU memory allocator warm-up
L2 cache population
cuBLAS handle creation

Kernel Tuning

Kernel tuning finds the optimal split_k_iters parameter for each unique (in_features, out_features) WQLinear shape by benchmarking different values (1, 2, 4, 8, 16, 32) and selecting the one with lowest median latency.

The split_k_iters parameter controls how the matrix multiplication is partitioned across GPU thread blocks. The optimal value depends on the specific matrix dimensions and the GPU architecture. By profiling each unique layer shape, the tuner ensures that every quantized linear layer uses its best configuration.

Usage

After model loading and before the inference loop:

Call device warmup to initialize GPU state
Run kernel tuning across all WQLinear layers in the model
Begin the inference/chat loop with optimized kernel parameters

These steps typically add a few seconds to model loading time but can improve per-token generation speed by 10-30%.

Related Pages

Knowledge Sources

Repo|llm-awq|https://github.com/mit-han-lab/llm-awq

Domains

Inference
Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment