Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mit han lab Llm awq Device Warmup and Kernel Tuning

From Leeroopedia

Overview

Runtime optimization process that warms up GPU caches and auto-tunes quantized GEMM kernel parameters for optimal inference performance.

Description

Before running inference, two optimizations are applied:

Device Warmup

Device warmup runs dummy matrix multiplications to initialize GPU caches, memory allocators, and CUDA contexts, avoiding cold-start latency during actual inference. Without warmup, the first few inference calls exhibit significantly higher latency due to:

  • CUDA context initialization
  • GPU memory allocator warm-up
  • L2 cache population
  • cuBLAS handle creation

Kernel Tuning

Kernel tuning finds the optimal split_k_iters parameter for each unique (in_features, out_features) WQLinear shape by benchmarking different values (1, 2, 4, 8, 16, 32) and selecting the one with lowest median latency.

The split_k_iters parameter controls how the matrix multiplication is partitioned across GPU thread blocks. The optimal value depends on the specific matrix dimensions and the GPU architecture. By profiling each unique layer shape, the tuner ensures that every quantized linear layer uses its best configuration.

Usage

After model loading and before the inference loop:

  • Call device warmup to initialize GPU state
  • Run kernel tuning across all WQLinear layers in the model
  • Begin the inference/chat loop with optimized kernel parameters

These steps typically add a few seconds to model loading time but can improve per-token generation speed by 10-30%.

Related Pages

Knowledge Sources

Domains

  • Inference
  • Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment