Principle:Mit han lab Llm awq Device Warmup and Kernel Tuning
Overview
Runtime optimization process that warms up GPU caches and auto-tunes quantized GEMM kernel parameters for optimal inference performance.
Description
Before running inference, two optimizations are applied:
Device Warmup
Device warmup runs dummy matrix multiplications to initialize GPU caches, memory allocators, and CUDA contexts, avoiding cold-start latency during actual inference. Without warmup, the first few inference calls exhibit significantly higher latency due to:
- CUDA context initialization
- GPU memory allocator warm-up
- L2 cache population
- cuBLAS handle creation
Kernel Tuning
Kernel tuning finds the optimal split_k_iters parameter for each unique (in_features, out_features) WQLinear shape by benchmarking different values (1, 2, 4, 8, 16, 32) and selecting the one with lowest median latency.
The split_k_iters parameter controls how the matrix multiplication is partitioned across GPU thread blocks. The optimal value depends on the specific matrix dimensions and the GPU architecture. By profiling each unique layer shape, the tuner ensures that every quantized linear layer uses its best configuration.
Usage
After model loading and before the inference loop:
- Call device warmup to initialize GPU state
- Run kernel tuning across all WQLinear layers in the model
- Begin the inference/chat loop with optimized kernel parameters
These steps typically add a few seconds to model loading time but can improve per-token generation speed by 10-30%.
Related Pages
- Implementation:Mit_han_lab_Llm_awq_Device_warmup
- Heuristic:Mit_han_lab_Llm_awq_Kernel_Selection_Thresholds
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Domains
- Inference
- Optimization